Aggregation of Document Frequencies in Unstructured P2P Networks

  • Robert Neumayer
  • Christos Doulkeridis
  • Kjetil Nørvåg
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5802)


Peer-to-peer (P2P) systems have been recently proposed for providing search and information retrieval facilities over distributed data sources, including web data. Terms and their document frequencies are the main building blocks of retrieval and as such need to be computed, aggregated, and distributed throughout the system. This is a tedious task, as the local view of each peer may not reflect the global document collection, due to skewed document distributions. Moreover, central assembly of the total information is not feasible, due to the prohibitive cost of storage and maintenance, and also because of issues related to digital rights management. In this paper, we propose an efficient approach for aggregating the document frequencies of carefully selected terms based on a hierarchical overlay network. To this end, we examine unsupervised feature selection techniques at the individual peer level, in order to identify only a limited set of the most important terms for aggregation. We provide a theoretical analysis to compute the cost of our approach, and we conduct experiments on two document collections, in order to measure the quality of the aggregated document frequencies.


Feature Selection Information Retrieval Digital Library Feature Selection Method Success Ratio 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ahmad, K., Gillam, L., Tostevin, L.: Weirdness indexing for logical document extrapolation and retrieval WILDER. In: TREC (1999)Google Scholar
  2. 2.
    Balke, W.-T.: Supporting information retrieval in peer-to-peer systems. In: Steinmetz, R., Wehrle, K. (eds.) Peer-to-Peer Systems and Applications. LNCS, vol. 3485, pp. 337–352. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  3. 3.
    Balke, W.-T., Nejdl, W., Siberski, W., Thaden, U.: DL Meets P2P – Distributed Document Retrieval Based on Classification and Content. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds.) ECDL 2005. LNCS, vol. 3652, pp. 379–390. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  4. 4.
    Balke, W.-T., Nejdl, W., Siberski, W., Thaden, U.: Progressive distributed top-k retrieval in peer-to-peer networks. In: Proc. of ICDE (2005)Google Scholar
  5. 5.
    Bender, M., Michel, S., Triantafillou, P., Weikum, G.: Global document frequency estimation in peer-to-peer web search. In: Proc. of the 9th Int. Workshop on the web and databases (2006)Google Scholar
  6. 6.
    Cuenca-Acuna, F., Peery, C., Martin, R., Nguyen, T.: PlanetP: Using gossiping to build content addressable peer-to-peer information sharing communities. In: Proc. of HPDC (2003)Google Scholar
  7. 7.
    Doulkeridis, C., Nørvåg, K., Vazirgiannis, M.: Scalable semantic overlay generation for P2P-based digital libraries. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds.) ECDL 2006. LNCS, vol. 4172, pp. 26–38. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. 8.
    Doulkeridis, C., Nørvåg, K., Vazirgiannis, M.: DESENT: Decentralized and distributed semantic overlay generation in P2P networks. Journal on Selected Areas in Communications 25(1) (2007)Google Scholar
  9. 9.
    Lu, J., Callan, J.: Full-text federated search of text-based digital libraries in peer-to-peer networks. Information Retrieval 9(4) (2006)Google Scholar
  10. 10.
    Melink, S., Raghavan, S., Yang, B., Garcia-Molina, H.: Building a distributed full-text index for the web. ACM Transactions on Information Systems 19(3) (2001)Google Scholar
  11. 11.
    Michel, S., Triantafillou, P., Weikum, G.: MINERVA infinity: A scalable efficient peer-to-peer search engine. In: Alonso, G. (ed.) Middleware 2005. LNCS, vol. 3790, pp. 60–81. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  12. 12.
    Nottelmann, H., Fuhr, N.: Comparing different architectures for query routing in peer-to-peer networks. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 253–264. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  13. 13.
    Papapetrou, O., Michel, S., Bender, M., Weikum, G.: On the usage of global document occurrences in peer-to-peer information systems. In: Proc. of COOPIS (2005)Google Scholar
  14. 14.
    Podnar, I., Luu, T., Rajman, M., Klemm, F., Aberer, K.: A P2P architecture for information retrieval across digital library collections. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds.) ECDL 2006. LNCS, vol. 4172, pp. 14–25. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  15. 15.
    Raftopoulou, P., Petrakis, E.G.M., Tryfonopoulos, C., Weikum, G.: Information retrieval and filtering over self-organising digital libraries. In: Christensen-Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds.) ECDL 2008. LNCS, vol. 5173, pp. 320–333. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  16. 16.
    Sahin, O.D., Emekçi, F., Agrawal, D., Abbadi, A.E.: Content-based similarity search over peer-to-peer systems. In: Ng, W.S., Ooi, B.-C., Ouksel, A.M., Sartori, C. (eds.) DBISP2P 2004. LNCS, vol. 3367, pp. 61–78. Springer, Heidelberg (2005)Google Scholar
  17. 17.
    Skobeltsyn, G., Luu, T., Zarko, I.P., Rajman, M., Aberer, K.: Query-driven indexing for scalable peer-to-peer text retrieval. In: Proc. of Infoscale (2007)Google Scholar
  18. 18.
    Suel, T., Mathur, C., wen Wu, J., Zhang, J., Delis, A., Mehdi, Kharrazi, X.L., Shanmugasundaram, K.: Odissea: A peer-to-peer architecture for scalable web search and information retrieval. In: Proc. of WebDB (2003)Google Scholar
  19. 19.
    Tang, C., Dwarkadas, S.: Hybrid global-local indexing for efficient peer-to-peer information retrieval. In: Proc. of NSDI (2004)Google Scholar
  20. 20.
    Viles, C.L., French, J.C.: Dissemination of collection wide information in a distributed information retrieval system. In: Proc. of SIGIR (1995)Google Scholar
  21. 21.
    Viles, C.L., French, J.C.: On the update of term weights in dynamic information retrieval systems. In: Proc. of CIKM (1995)Google Scholar
  22. 22.
    Witschel, H.F.: Global term weights in distributed environments. Information Processing and Management 44(3) (2008)Google Scholar
  23. 23.
    Xu, Y., Wang, B., Li, J., Jing, H.: An extended document frequency metric for feature selection in text categorization. In: Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., Zhou, G. (eds.) AIRS 2008. LNCS, vol. 4993, pp. 71–82. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  24. 24.
    Zhang, J., Suel, T.: Efficient query evaluation on large textual collections in a peer-to-peer environment. In: Proc. of IEEE P2P (2005)Google Scholar
  25. 25.
    Zimmer, C., Tryfonopoulos, C., Berberich, K., Koubarakis, M., Weikum, G.: Approximate information filtering in peer-to-peer networks. In: Bailey, J., Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS, vol. 5175, pp. 6–19. Springer, Heidelberg (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Robert Neumayer
    • 1
  • Christos Doulkeridis
    • 1
  • Kjetil Nørvåg
    • 1
  1. 1.Norwegian University of Science and TechnologyTrondheimNorway

Personalised recommendations