The VLDB Journal

, Volume 17, Issue 6, pp 1371–1384 | Cite as

PicShark: mitigating metadata scarcity through large-scale P2P collaboration

  • Philippe Cudré-Mauroux
  • Adriana Budura
  • Manfred Hauswirth
  • Karl Aberer
Special Issue Paper


With the commoditization of digital devices, personal information and media sharing is becoming a key application on the pervasive Web. In such a context, data annotation rather than data production is the main bottleneck. Metadata scarcity represents a major obstacle preventing efficient information processing in large and heterogeneous communities. However, social communities also open the door to new possibilities for addressing local metadata scarcity by taking advantage of global collections of resources. We propose to tackle the lack of metadata in large-scale distributed systems through a collaborative process leveraging on both content and metadata. We develop a community-based and self-organizing system called PicShark in which information entropy—in terms of missing metadata—is gradually alleviated through decentralized instance and schema matching. Our approach focuses on semi-structured metadata and confines computationally expensive operations to the edge of the network, while keeping distributed operations as simple as possible to ensure scalability. PicShark builds on structured Peer-to-Peer networks for distributed look-up operations, but extends the application of self-organization principles to the propagation of metadata and the creation of schema mappings. We demonstrate the practical applicability of our method in an image sharing scenario and provide experimental evidences illustrating the validity of our approach.


Metadata scarcity Metadata heterogeneity Metadata entropy Peer-to-Peer collaboration Peer data management 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aberer K., Cudré-Mauroux P., Datta A., Despotovic Z., Hauswirth M., Punceva M., Schmidt R.: P-grid: A self-organizing structured p2p system. ACM SIGMOD Rec. 32(3), 29–33 (2003)CrossRefGoogle Scholar
  2. 2.
    Arenas M., Kantere V., Kementsietsidis A., Kiringa I., Miller R.J., Mylopoulos J.: The Hyperion project: from data integration to data coordination. ACM SIGMOD Rec. 32(3), 53–58 (2003)CrossRefGoogle Scholar
  3. 3.
    Aurnhammer, M., Hanappe, P., Steels, L.: Integrating collaborative tagging and emergent semantics for image retrieval. In: Collaborative Web Tagging Workshop (2006)Google Scholar
  4. 4.
    Batista G., Monard M.C.: An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell. 17(5–6), 519–533 (2003)CrossRefGoogle Scholar
  5. 5.
    Boag, S., Chamberlin, D., Fernández, M.F., Florescu D., Robie J., Siméon, J. (ed.): XQuery 1.0: An XML Query Language. W3C Candidate Recommendation, June (2006)
  6. 6.
    Bray, T., Paoli, J., Sperberg-McQueen, C.M., Maler, E., Yergeau F.: (Ed.). Extensible Markup Language (XML) 1.0. W3C Recommendation, February (2004)
  7. 7.
    Chirita, P.-A., Costache, S., Nejdl, W., Handschuh, S.: P-tag: large scale automatic generation of personalized annotation tags for the web. In: International World Wide Web Conference (WWW) (2007)Google Scholar
  8. 8.
    Cudré-Mauroux, P.: Emergent semantics: Interoperability in large-scale decentralized information systems. CRC Press, LLC (2008)Google Scholar
  9. 9.
    Cudré-Mauroux, P., Aberer, K.: a necessary condition for semantic interoperability in the large. In: Ontologies, DataBases, and Applications of Semantics for Large Scale Information Systems (ODBASE) (2004)Google Scholar
  10. 10.
    Cudré-Mauroux P.: Suchit Agarwal, and Karl Aberer. Gridvine: An infrastructure for peer information management. IEEE Internet Comput. 11(5), 36–44 (2007)CrossRefGoogle Scholar
  11. 11.
    Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: International Conference on Very Large Data Bases (VLDB) (2004)Google Scholar
  12. 12.
    Dong, X., Halevy, A.Y.: A platform for personal information management and integration. In: CIDR (2005)Google Scholar
  13. 13.
    Donini, F.M.: Complexity of reasoning. In: The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, London (2003)Google Scholar
  14. 14.
    Dumais, S.T., Cutrell, E., Cadiz, J.J., Jancke, G., Sarin, R., Robbins, D.C.: Stuff I’ve seen: a system for personal information retrieval and re-use. In: International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2003)Google Scholar
  15. 15.
    Troncy, R., et al. (ed.) Image annotation on the semantic web. W3C Incubator Group Report, August (2007)
  16. 16.
    Farhangfar A., Kurgan L.A., Pedrycz W.: Experimental analysis of methods for imputation of missing values in databases. Intell. Comput. Theory Appl. II. 5421(1), 172–182 (2004)Google Scholar
  17. 17.
    Fluit, C., Horak, B., Grimnes, G.A., Dengel, A., Nadeem, D., Sauermann, L., Heim, D., Kiesel, M.: Semantic desktop 2.0: The gnowsis experience. In: International Semantic Web Conference (ISWC) (2006)Google Scholar
  18. 18.
    Haghani, P., Michel, S., Cudré-Mauroux, P., Aberer, K.: LSH at large—distributed KNN search in high dimensions. In: International Workshop on Web and Databases (WebDB) (2008)Google Scholar
  19. 19.
    Hellerstein J.M.: Toward network data independence. ACM SIGMOD Rec. 32(3), 34–40 (2003)CrossRefGoogle Scholar
  20. 20.
    Karger, D.R., Bakshi, K., Huynh, D., Quan, D., Sinha, V.: Haystack: A general-purpose information management tool for end users based on semistructured data. In: Biennial Conference on Innovative Data Systems Research (CIDR) (2005)Google Scholar
  21. 21.
    Khinchin A.I.: Mathematical Foundations of Information Theory. Dover Publications, Inc., New York (1957)zbMATHGoogle Scholar
  22. 22.
    Li, R., Bao, S., Fei, B., Su, Z., Yu, Y.: Towards effective browsing of large scale social annotations. In: Proceedings of the WWW 2007, pp. 943–952 (2007)Google Scholar
  23. 23.
    Madhavan, J., Bernstein, P.A., Doan, A., Alon Halevy, Y.: Corpus-based schema matching. In: International Conference on Data Engineering (ICDE) (2005)Google Scholar
  24. 24.
    Manola, F., Miller, E. (ed.): RDF Primer. W3C recommendation, February (2004)
  25. 25.
    McGuinness, D.L., van Harmelen, F.: (Ed.). OWL web ontology language overview. W3C Recommendation, February (2004)
  26. 26.
    Naaman, M., Yeh, R.B., Garcia-Molina, H., Paepcke, A.: Leveraging context to resolve identity in photo albums. In: ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) (2005)Google Scholar
  27. 27.
    Prud’hommeaux, E., Seaborne van Harmelen, A. (ed.): SPARQL Query Language for RDF. W3C Candidate Recommendation, April (2006)
  28. 28.
    Rahm, E., Bernstein, P.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)Google Scholar
  29. 29.
    Rattenbury, T., Good, N., Naaman, M.: Towards automatic extraction of event and place semantics from flickr tags. In: International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2007)Google Scholar
  30. 30.
    Schmitz, P.: Inducing ontology from flickr tags. In: Collaborative Web Tagging Workshop Edinburgh, Scotland, (2006)Google Scholar
  31. 31.
    Seaborne, A.: RDQL - A Query Language for RDF. W3C Member Submission, 2004.
  32. 32.
    Tatarinov I., Ives Z., Madhavan J., Halevy A., Suciu D., Dalvi N., Dong X., Kadiyaska Y., Miklau G., Mork P.: The Piazza Peer Data Management Project. ACM SIGMOD Rec. 32(3), 47–52 (2003)CrossRefGoogle Scholar
  33. 33.
    Taylor, N.E., Ives, Z.G.: Reconciling while tolerating disagreement in collaborative data sharing. In: SIGMOD Conference (2006)Google Scholar
  34. 34.
    Wiederhold G.: Mediators in the Architecture of Future Information Systems. IEEE Comput. 25(3), 38–49 (1992)Google Scholar
  35. 35.
    Wu, X., Zhang, L., Yu, Y.: Exploring social annotations for the semantic web. In International World Wide Web Conference (WWW) New York, New York (2006)Google Scholar
  36. 36.
    Yang, Y., Ault, T., Pierce, T.: Combining multiple learning strategies for effective cross validation. In: International Conference on Machine Learning (ICML) (2000)Google Scholar

Copyright information

© Springer-Verlag 2008

Authors and Affiliations

  • Philippe Cudré-Mauroux
    • 1
  • Adriana Budura
    • 1
  • Manfred Hauswirth
    • 2
  • Karl Aberer
    • 1
  1. 1.School of Computer and Communication Sciences EPFLLausanneSwitzerland
  2. 2.Digital Enterprise Research InstituteNational University of IrelandGalwayIreland

Personalised recommendations