Informatik - Forschung und Entwicklung

, Volume 20, Issue 3, pp 152–166 | Cite as

Das MINERVA-Projekt: Datenbankselektion für Peer-to-Peer-Websuche

  • Matthias Bender
  • Sebastian Michel
  • Gerhard Weikum
  • Christian Zimmer
Original Article
  • 47 Downloads

Zusammenfassung

In diesem Artikel wird MINERVA präsentiert, eine prototypische Implementierung einer verteilten Suchmaschine basierend auf einer Peer-to-Peer (P2P)-Architektur. MINERVA setzt auf die in der P2P-Welt verbreitete Technik verteilter Hash-Tabellen auf und benutzt diese zum Aufbau eines verteilten Verzeichnisses. Peers in unserem Ansatz entsprechen völlig autonomen Benutzern mit ihren lokalen Suchm"oglichkeiten, die bereit sind, ihr lokales Wissen und ihre lokalen Suchmöglichkeiten im Rahmen einer Kollaboration zur Verfügung zu stellen. Wir formalisieren unsere Systemarchitektur und beschreiben das zentrale Problem einer effizienten Suche nach vielversprechenden Peers für eine konkrete Anfrage innerhalb des Verbundes. Wir greifen dabei auf existierende Methoden zurück and passen diese an unseren Systemkontext an. Wir präsentieren Experimente auf realen Daten, die verschiedene dieser Ansätze vergleichen. Diese Experimente zeigen, dass die Qualität der Ansätze variiert und untermauern damit die Wichtigkeit und den Einfluss einer leistungsstarken Methode zur Auswahl guter Datenbanken. Unsere Experimente deuten an, dass eine geringe Anzahl sorgfältig ausgewählter Datenbanken typischerweise bereits einen Großteil aller relevanten Ergebnisse des Gesamtsystems liefert.

Abstract

This paper presents the MINERVA project that protoypes a distributed search engine based on P2P techniques. MINERVA is layered on top of a Chord-style overlay network and uses a powerful crawling, indexing, and search engine on every autonomous peer. We formalize our system model and identify the problem of efficiently selecting promising peers for a query as a pivotal issue. We revisit existing approaches to the database selection problem and adapt them to our system environment. Measurements are performed to compare different selection strategies using real-world data. The experiments show significant performance differences between the strategies and prove the importance of a judicious peer selection strategy. The experiments also present first evidence that a small number of carefully selected peers already provide the vast majority of all relevant results.

Keywords

Peer-to-Peer Query Routing Web Search 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Literatur

  1. 1.
    Alonso G, Casati F, Kuno H (2004) Web Services – Concepts, Architectures and Applications. Springer, Berlin Heidelberg New YorkGoogle Scholar
  2. 2.
    Aberer K, Cudre-Mauroux P, Hauswirth M, Van Pelt T (2004) Gridvine: Building internet-scale semantic overlay networks. Technical report, EPFLGoogle Scholar
  3. 3.
    Aberer K, Hauswirth M, Punceva M, Schmidt R (2002) Improving data access in p2p systems. IEEE Internet Computing 6(1):58–67Google Scholar
  4. 4.
    Buchmann E, Böhm K (2004) How to Run Experiments with Large Peer-to-Peer Data Structures. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, Santa Fe, USA, April 2004Google Scholar
  5. 5.
    Bender M, Michel S, Weikum G, Zimmer C (2004) Bookmark-driven routing in peer-to-peer web search. In: Callan J, Fuhr N, Nejdl W (eds) Proceedings of the SIGIR Workshop on Peer-to-Peer Information-Retrieval, pp 46–57Google Scholar
  6. 6.
    Callan J (2000) Distributed information retrieval. Advances in information retrieval, Kluwer Academic Publishers, pp 127–150Google Scholar
  7. 7.
    Cuenca-Acuna FM, Peery C, Martin RP, Nguyen TD (2002) PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities. Technical Report DCS-TR-487, Rutgers University, September 2002Google Scholar
  8. 8.
    Cohen E, Fiat A, Kaplan H (2003) Associative search in peer to peer networks: Harnessing latent semantics. In: Proceedings of the IEEE INFOCOM’03 Conference, April 2003Google Scholar
  9. 9.
    Crespo A, Garcia-Molina H (2002) Routing indices for peer-to-peer systems. In: Proc. of the 28th Conference on Distributed Computing Systems, July 2002Google Scholar
  10. 10.
    Crespo A, Garcia-Molina H (2002) Semantic Overlay Networks for P2P Systems. Technical report, Stanford University, October 2002Google Scholar
  11. 11.
    Chakrabarti S (2002) Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San FranciscoGoogle Scholar
  12. 12.
    Croft WB, Lafferty J (2003) Language Modeling for Information-Retrieval, vol 13. Kluwer International Series on Information-RetrievalGoogle Scholar
  13. 13.
    Callan JP, Lu Z, Croft WB (1995) Searching distributed collections with inference networks. In: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval. ACM Press, pp 21–28Google Scholar
  14. 14.
    Diaconis P, Graham R (1977) Spearman’s footrule as a measure of disarray. Journal of the Royal Statistical Society, pp 262–268Google Scholar
  15. 15.
    Diaconis P, Graham R (1988) Group representation in probability and statistics. Institute of Mathematical StatisticsGoogle Scholar
  16. 16.
    Fagin R (1999) Combining fuzzy information from multiple systems. J Comput Syst Sci 58(1):83–99Google Scholar
  17. 17.
    Fagin R, Kumar R, Sivakumar D (2003) Comparing top k lists. In: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, pp 28–36Google Scholar
  18. 18.
    Fagin R, Lotem A, Naor M (2001) Optimal aggregation algorithms for middleware. In: Symposium on Principles of Database SystemsGoogle Scholar
  19. 19.
    Fuhr N (1999) A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems 17(3):229–249Google Scholar
  20. 20.
    Guntzer U, Balke W-T, Kiesling W (2000) Optimizing multi-feature queries for image databases. In: The VLDB Journal, pp 419–428Google Scholar
  21. 21.
    Grabs T, Böhm K, Schek H-J (2001) Powerdb-ir: information retrieval on top of a database cluster. In: Proceedings of the tenth international conference on Information and knowledge management. ACM Press, pp 411–418Google Scholar
  22. 22.
    Gravano L, Garcia-Molina H, Tomasic A (1999) Gloss: text-source discovery over the internet. ACM Trans Database Syst 24(2):229–264Google Scholar
  23. 23.
    Kendall M, Gibbons JD (1990) Rank correlation methods. Edward Arnold, LondonGoogle Scholar
  24. 24.
    Karger D, Lehman E, Leighton T, Levine M, Lewin D, Panigrahy R (1997) Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In: ACM Symposium on Theory of Computing, pp 654–663, May 1997Google Scholar
  25. 25.
    Lu J, Callan J (2003) Content-based retrieval in hybrid peer-to-peer networks. In: Proceedings of the twelfth international conference on Information and knowledge management. ACM Press, pp 199–206Google Scholar
  26. 26.
    Löser A, Siberski W, Naumann F, Nejdl W, Thaden U (2003) Semantic overlay clusters within super-peer networks. In: Proceedings of the International Workshop on Databases, Information Systems and Peer-to-Peer Computing, (DBISP2P), pp 33–47Google Scholar
  27. 27.
    Ludwig T (1993) Lastverwaltung für parallelrechnerGoogle Scholar
  28. 28.
    Luxenburger J, Weikum G (2004) Query-log based authority analysis for web information search. In: WISE04Google Scholar
  29. 29.
    Melnik S, Garcia-Molina H, Raghavan S, Yang B (2001) Building a distributed full-text index for the web. ACM Trans Inf Syst 19(3):217–241Google Scholar
  30. 30.
    Manning CD, Schütze H (1999) Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MassachusettsGoogle Scholar
  31. 31.
    Meng W, Yu CT, Liu K-L (2002) Building efficient and effective metasearch engines. ACM Computing Surveys 34(1):48–89Google Scholar
  32. 32.
    Nottelmann H, Fuhr N (2003) Evaluating different methods of estimating retrieval quality for resource selection. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM Press, pp 290–297Google Scholar
  33. 33.
    Nepal S, Ramakrishna MV (1999) Query processing issues in image (multimedia) databases. In: ICDE, pp 22–29Google Scholar
  34. 34.
    Rowstron A, Druschel P (2001) Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), pp 329–350Google Scholar
  35. 35.
    Ratnasamy S, Francis P, Handley M, Karp R, Schenker S (2001) A scalable content-addressable network. In: Proceedings of ACM SIGCOMM 2001. ACM Press, pp 161–172Google Scholar
  36. 36.
    Reynolds P, Vahdat A (2003) Efficient peer-to-peer keyword searching. In: Proceedings of International Middleware Conference, pp 21–40, June 2003Google Scholar
  37. 37.
    Si L, Jin R, Callan J, Ogilvie P (2002) A language modeling framework for resource selection and results merging. In: Proceedings of the eleventh international conference on Information and knowledge management. ACM Press, pp 391–397Google Scholar
  38. 38.
    Stoica I, Karger D, Morris R, Kaashoek MF, Balakrishnan H (2001) Chord: A scalable peer-to-peer lookup service for internet applications. In: Proceedings of the 2001 conference on applications, technologies, architectures, and protocols for computer communications. ACM Press, pp 149–160Google Scholar
  39. 39.
    Suel T, Mathur C, Wu J, Zhang J, Delis A, Kharrazi M, Long X, Shanmugasunderam K (2003) Odissea: A peer-to-peer architecture for scalable web search and information retrieval. Technical report, Polytechnic UnivGoogle Scholar
  40. 40.
    Theobald M, Weikum G, Schenkel R (2004) Top-k query evaluation with probabilistic guarantees. In: VLDB, pp 648–659Google Scholar
  41. 41.
    Tang C, Xu Z, Dwarkadas S (2003) Peer-to-peer information retrieval using self-organizing semantic overlay networks. In: Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications. ACM Press, pp 175–186Google Scholar
  42. 42.
    Wang Y, Galanis L, de Witt DJ (2003) Galanx: An efficient peer-to-peer search engine system. Available at http://www.cs.wisc.edu/∼yuanwangGoogle Scholar
  43. 43.
    Wu Z, Meng W, Yu CT, Li Z (2001) Towards a highly-scalable and effective metasearch engine. In: World Wide Web, pp 386–395Google Scholar
  44. 44.
    Xu J, Croft WB (1999) Cluster-based language models for distributed retrieval. In: Research and Development in Information-Retrieval, pp 254–261Google Scholar
  45. 45.
    B Yang, Garcia-Molina H (2002) Improving search in peer-to-peer networks. In: Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS’02). IEEE Computer Society, pp 5–14Google Scholar

Copyright information

© Springer-Verlag 2005

Authors and Affiliations

  • Matthias Bender
    • 1
  • Sebastian Michel
    • 1
  • Gerhard Weikum
    • 1
  • Christian Zimmer
    • 1
  1. 1.Max-Planck-Institut für InformatikSaarbrückenDeutschland

Personalised recommendations