Structural Properties as Proxy for Semantic Relevance in RDF Graph Sampling

  • Laurens Rietveld
  • Rinke Hoekstra
  • Stefan Schlobach
  • Christophe Guéret
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8797)


The Linked Data cloud has grown to become the largest knowledge base ever constructed. Its size is now turning into a major bottleneck for many applications. In order to facilitate access to this structured information, this paper proposes an automatic sampling method targeted at maximizing answer coverage for applications using SPARQL querying. The approach presented in this paper is novel: no similar RDF sampling approach exist. Additionally, the concept of creating a sample aimed at maximizing SPARQL answer coverage, is unique. We empirically show that the relevance of triples for sampling (a semantic notion) is influenced by the topology of the graph (purely structural), and can be determined without prior knowledge of the queries. Experiments show a significantly higher recall of topology based sampling methods over random and naive baseline approaches (e.g. up to 90% for Open-BioMed at a sample size of 6%).


subgraphs sampling graph analysis ranking Linked Data 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Angles Rojas, R., Minh Duc, P., Boncz, P.A.: Benchmarking Linked Open Data Management Systems. ERCIM News 96, 24–25 (2014)Google Scholar
  2. 2.
    Anyanwu, K., Maduko, A., Sheth, A.: SemRank: ranking complex relationship search results on the semantic web. In: Proceedings of the 14th International Conference on WWW, pp. 117–127. ACM (2005)Google Scholar
  3. 3.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: Dbpedia: A nucleus for a web of open data. In: Aberer, K., et al. (eds.) ISWC/ASWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)Google Scholar
  4. 4.
    Auer, S., Demter, J., Martin, M., Lehmann, J.: Lodstats–an extensible framework for high-performance dataset analytics. In: ten Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) EKAW 2012. LNCS (LNAI), vol. 7603, pp. 353–362. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  5. 5.
    Auer, S., Lehmann, J., Hellmann, S.: Linkedgeodata: Adding a spatial dimension to the web of data. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 731–746. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  6. 6.
    Avery, C.: Giraph: Large-scale graph processing infrastructure on hadoop. In: Proceedings of the Hadoop Summit, Santa Clara (2011)Google Scholar
  7. 7.
    Balmin, A., Hristidis, V., Papakonstantinou, Y.: Objectrank: Authority-based keyword search in databases. In: VLDB, pp. 564–575 (2004)Google Scholar
  8. 8.
    Belleau, F., Nolin, M.A., Tourigny, N., Rigault, P., Morissette, J.: Bio2rdf: towards a mashup to build bioinformatics knowledge systems. Journal of Biomedical Informatics 41(5), 706–716 (2008)CrossRefGoogle Scholar
  9. 9.
    Berendt, B., Hollink, L., Luczak-Rösch, M., Möller, K., Vallet, D.: Usewod2013 3rd international workshop on usage analysis and the web of data. In: 10th ESWC - Semantics and Big Data, Montpellier, France (2013)Google Scholar
  10. 10.
    Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.-Y.: Sparql web-querying infrastructure: Ready for action? In: Alani, H., et al. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 277–293. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  11. 11.
    Campinas, S., Perry, T.E., Ceccarelli, D., Delbru, R., Tummarello, G.: Introducing RDF Graph Summary with application to Assisted SPARQL Formulation. In: 23rd International Workshop on Database and Expert Systems Applications (2012)Google Scholar
  12. 12.
    Franz, T., Schultz, A., Sizov, S., Staab, S.: Triplerank: Ranking semantic web data by tensor decomposition. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 213–228. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  13. 13.
    Gates, A.F., et al.: Building a high-level dataflow system on top of map-reduce: the pig experience. Proceedings of the VLDB Endowment 2(2), 1414–1425 (2009)CrossRefGoogle Scholar
  14. 14.
    Görlitz, O., Thimm, M., Staab, S.: Splodge: systematic generation of sparql benchmark queries for linked open data. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012, Part I. LNCS, vol. 7649, pp. 116–132. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  15. 15.
    Gottron, T., Pickhardt, R.: A detailed analysis of the quality of stream-based schema construction on linked open data. In: Semantic Web and Web Science, pp. 89–102. Springer (2013)Google Scholar
  16. 16.
    Guéret, C., Wang, S., Groth, P., Schlobach, S.: Multi-scale analysis of the web of data: A challenge to the complex system’s community. Advances in Complex Systems 14(04), 587 (2011)CrossRefGoogle Scholar
  17. 17.
    Halaschek, C., Aleman-meza, B., Arpinar, I.B., Sheth, A.P.: Discovering and ranking semantic associations over a large rdf metabase. In: VLDB (2004)Google Scholar
  18. 18.
    Hayes, J., Gutierrez, C.: Bipartite Graphs as Intermediate Model for RDF. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 47–61. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  19. 19.
    Hoekstra, R.: The MetaLex Document Server - Legal Documents as Versioned Linked Data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part II. LNCS, vol. 7032, pp. 128–143. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  20. 20.
    Hogan, A., Harth, A., Decker, S.: Reconrank: A scalable ranking method for semantic web data with context. In: 2nd Workshop on Scalable Semantic Web Knowledge Base Systems (2006)Google Scholar
  21. 21.
    Kanehisa, M., et al.: From genomics to chemical genomics: new developments in kegg. Nucleic Acids Research 34(suppl. 1), D354–D357 (2006)Google Scholar
  22. 22.
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM) 46(5), 604–632 (1999)MathSciNetCrossRefMATHGoogle Scholar
  23. 23.
    Leskovec, J., Faloutsos, C.: Sampling from large graphs. In: The 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 631–636 (2006)Google Scholar
  24. 24.
    Möller, K., Heath, T., Handschuh, S., Domingue, J.: Recipes for semantic web dog food. the eswc and iswc metadata projects. In: Aberer, K., et al. (eds.) ISWC/ASWC 2007. LNCS, vol. 4825, pp. 802–815. Springer, Heidelberg (2007)Google Scholar
  25. 25.
    Pérez, J., Arenas, M., Gutierrez, C.: Semantics of SPARQL. Technical Report TR/DCC-2006-17, Department of Computer Science, Universidad de Chile (2006)Google Scholar
  26. 26.
    Picalausa, F., Vansummeren, S.: What are real sparql queries like? In: International Workshop on Semantic Web Information Management, p. 7. ACM (2011)Google Scholar
  27. 27.
    Rietveld, L., Hoekstra, R.: YASGUI: Not Just Another SPARQL Client. In: Cimiano, P., Fernández, M., Lopez, V., Schlobach, S., Völker, J. (eds.) ESWC 2013. LNCS, vol. 7955, pp. 78–86. Springer, Heidelberg (2013)Google Scholar
  28. 28.
    Rietveld, L., Hoekstra, R.: Man vs. Machine: Differences in SPARQL Queries. In: 4th USEWOD Workshop on Usage Analysis and the Web of Data, ESWC (2014)Google Scholar
  29. 29.
    Schmidt, M., Hornung, T., Meier, M., Pinkel, C., Lausen, G.: Sp2bench: A sparql performance benchmark. In: Semantic Web Information Management, pp. 371–393. Springer (2010)Google Scholar
  30. 30.
    Sundara, S., et al.: Visualizing large-scale rdf data using subsets, summaries, and sampling in oracle. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 1048–1059. IEEE (2010)Google Scholar
  31. 31.
    Tan, G., Tu, D., Sun, N.: A parallel algorithm for computing betweenness centrality. In: Proc. of ICPP, pp. 340–347 (2009)Google Scholar
  32. 32.
    Tonon, A., Catasta, M., Demartini, G., Cudré-Mauroux, P., Aberer, K.: TRank: Ranking Entity Types Using the Web of Data. In: Alani, H., et al. (eds.) ISWC 2013, Part I. LNCS, vol. 8218, pp. 640–656. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  33. 33.
    Wang, S., Groth, P.: Measuring the dynamic bi-directional influence between content and social networks. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 814–829. Springer, Heidelberg (2010)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Laurens Rietveld
    • 1
  • Rinke Hoekstra
    • 1
    • 2
  • Stefan Schlobach
    • 1
  • Christophe Guéret
    • 3
  1. 1.Dept. of Computer ScienceVU University AmsterdamThe Netherlands
  2. 2.Leibniz Center for LawUniversity of AmsterdamThe Netherlands
  3. 3.Data Archiving and Network Services (DANS)The Netherlands

Personalised recommendations