IQN Routing: Integrating Quality and Novelty in P2P Querying and Ranking

  • Sebastian Michel
  • Matthias Bender
  • Peter Triantafillou
  • Gerhard Weikum
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3896)

Abstract

We consider a collaboration of peers autonomously crawling the Web. A pivotal issue when designing a peer-to-peer (P2P) Web search engine in this environment is query routing: selecting a small subset of (a potentially very large number of relevant) peers to contact to satisfy a keyword query. Existing approaches for query routing work well on disjoint data sets. However, naturally, the peers’ data collections often highly overlap, as popular documents are highly crawled. Techniques for estimating the cardinality of the overlap between sets, designed for and incorporated into information retrieval engines are very much lacking. In this paper we present a comprehensive evaluation of appropriate overlap estimators, showing how they can be incorporated into an efficient, iterative approach to query routing, coined Integrated Quality Novelty (IQN). We propose to further enhance our approach using histograms, combining overlap estimation with the available score/ranking information. Finally, we conduct a performance evaluation in MINERVA, our prototype P2P Web search engine.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aberer, K., Punceva, M., Hauswirth, M., Schmidt, R.: Improving data access in p2p systems. IEEE Internet Computing 6(1), 58–67 (2002)CrossRefGoogle Scholar
  2. 2.
    Aberer, K., Wu, J.: Towards a common framework for peer-to-peer web retrieval. From Integrated Publication and Information Systems to Virtual Information and Knowledge Environments (2005)Google Scholar
  3. 3.
    Agrawal, D.P., El Abbadi, A., Suri, S.: Attribute-based access to distributed data over P2P networks. In: Bhalla, S. (ed.) DNIS 2005. LNCS, vol. 3433, pp. 244–263. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  4. 4.
    Balke, W.-T., Nejdl, W., Siberski, W., Thaden, U.: DL meets P2P – distributed document retrieval based on classification and content. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds.) ECDL 2005. LNCS, vol. 3652, pp. 379–390. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  5. 5.
    Bender, M., Michel, S., Triantafillou, P., Weikum, G., Zimmer, C.: Improving collection selection with overlap awareness in p2p search engines. In: SIGIR (2005)Google Scholar
  6. 6.
    Bender, M., Michel, S., Triantafillou, P., Weikum, G., Zimmer, C.: Minerva: Collaborative p2p search. VLDB (2005)Google Scholar
  7. 7.
    Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)MATHCrossRefGoogle Scholar
  8. 8.
    Broder. On the resemblance and containment of documents. In: SEQUENCES (1997)Google Scholar
  9. 9.
    Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: STOC (1998)Google Scholar
  10. 10.
    Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. Journal of Computer and System Sciences 60(3) (2000)Google Scholar
  11. 11.
    Byers, J.W., Considine, J., Mitzenmacher, M., Rost, S.: Informed content delivery across adaptive overlay networks. IEEE/ACM Trans. Netw. 12(5), 767–780 (2004)CrossRefGoogle Scholar
  12. 12.
    Callan, J.: Distributed information retrieval. In: Advances in information retrieval, pp. 127–150. Kluwer Academic Publishers, Dordrecht (2000)Google Scholar
  13. 13.
    Callan, J.P., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: SIGIR (1995)Google Scholar
  14. 14.
    Cao, P., Wang, Z.: Efficient top-k query calculation in distributed networks. In: PODC (2004)Google Scholar
  15. 15.
    Crainiceanu, A., Linga, P., Machanavajjhala, A., Gehrke, J., Shanmugasundaram, J.: An indexing framework for peer-to-peer systems. In: SIGMOD (2004)Google Scholar
  16. 16.
    Durand, M., Flajolet, P.: Loglog counting of large cardinalities. In: Di Battista, G., Zwick, U. (eds.) ESA 2003. LNCS, vol. 2832, pp. 605–617. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  17. 17.
    Fan, L., Cao, P., Almeida, J.M., Broder, A.Z.: Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw. 8(3) (2000)Google Scholar
  18. 18.
    Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences 31(2), 182–209 (1985)MATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    Ganguly, S., Garofalakis, M., Rastogi, R.: Processing set expressions over continuous update streams. In: SIGMOD (2003)Google Scholar
  20. 20.
    Gravano, L., Garcia-Molina, H., Tomasic, A.: Gloss: text-source discovery over the internet. ACM Trans. Database Syst. 24(2), 229–264 (1999)CrossRefGoogle Scholar
  21. 21.
    Hernandez, T., Kambhampati, S.: Improving text collection selection with coverage and overlap statistics. In: WWW (2005)Google Scholar
  22. 22.
    Huebsch, R., Hellerstein, J.M., Boon, N.L., Loo, T., Shenker, S., Stoica, I.: Querying the internet with Pier. In: VLDB (2003)Google Scholar
  23. 23.
    Li, J., Loo, B., Hellerstein, J., Kaashoek, F., Karger, D., Morris, R.: On the feasibility of peer-to-peer web indexing and search. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735, Springer, Heidelberg (2003)CrossRefGoogle Scholar
  24. 24.
    Meng, W., Yu, C.T., Liu, K.-L.: Building efficient and effective metasearch engines. ACM Computing Surveys 34(1), 48–89 (2002)CrossRefGoogle Scholar
  25. 25.
    Michel, S., Triantafillou, P., Weikum, G.: KLEE: A framework for distributed top-k query algorithms. In: VLDB (2005)Google Scholar
  26. 26.
    Mitzenmacher, M.: Compressed bloom filters. IEEE/ACM Trans. Netw. 10(5), 604–612 (2002)CrossRefGoogle Scholar
  27. 27.
    Nie, Z., Kambhampati, S., Hernandez, T.: Bibfinder/statminer: Effectively mining and using coverage and overlap statistics in data integration. In: VLDB (2003)Google Scholar
  28. 28.
    Nottelmann, H., Fuhr, N.: Evaluating different methods of estimating retrieval quality for resource selection. In: SIGIR (2003)Google Scholar
  29. 29.
    Ratnasamy, S., Francis, P., Handley, M., Karp, R., Schenker, S.: A scalable content-addressable network. In: SIGCOMM (2001)Google Scholar
  30. 30.
    Reynolds, P., Vahdat, A.: Efficient peer-to-peer keyword searching. In: Endler, M., Schmidt, D.C. (eds.) Middleware 2003. LNCS, vol. 2672, Springer, Heidelberg (2003)CrossRefGoogle Scholar
  31. 31.
    Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, p. 329. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  32. 32.
    Si, L., Jin, R., Callan, J., Ogilvie, P.: A language modeling framework for resource selection and results merging. In: CIKM (2002)Google Scholar
  33. 33.
    Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: A scalable peer-to-peer lookup service for internet applications. In: SIGCOMM (2001)Google Scholar
  34. 34.
    Text REtrieval Conference (TREC), http://trec.nist.gov/.
  35. 35.
    Triantafillou, P., Pitoura, T.: Towards a unifying framework for complex query processing over structured peer-to-peer data networks. In: Aberer, K., Koubarakis, M., Kalogeraki, V. (eds.) VLDB 2003. LNCS, vol. 2944, pp. 169–183. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  36. 36.
    Wang, Y., DeWitt, D.J.: Computing pagerank in a distributed internet search engine system. In: VLDB (2004)Google Scholar
  37. 37.
    Zhang, J., Suel, T.: Efficient query evaluation on large textual collections in a peer-to-peer environment. In: 5th IEEE International Conference on Peer-to-Peer Computing (2005)Google Scholar
  38. 38.
    Zhang, Y., Callan, J., Minka, T.: Novelty and redundancy detection in adaptive filtering. In: SIGIR (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Sebastian Michel
    • 1
  • Matthias Bender
    • 1
  • Peter Triantafillou
    • 2
  • Gerhard Weikum
    • 1
  1. 1.Max-Planck-Institut für Informatik 
  2. 2.RACTI and University of Patras 

Personalised recommendations