Skip to main content
Log in

Analysis of schema structures in the Linked Open Data graph based on unique subject URIs, pay-level domains, and vocabulary usage

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

The Linked Open Data (LOD) graph represents a web-scale distributed knowledge graph interlinking information about entities across various domains. A core concept is the lack of pre-defined schema which actually allows for flexibly modelling data from all kinds of domains. However, Linked Data does exhibit schema information in a twofold way: by explicitly attaching RDF types to the entities and implicitly by using domain-specific properties to describe the entities. In this paper, we present and apply different techniques for investigating the schematic information encoded in the LOD graph at different levels of granularity. We investigate different information theoretic properties of so-called Unique Subject URIs (USUs) and measure the correlation between the properties and types that can be observed for USUs on a large-scale semantic graph data set. Our analysis provides insights into the information encoded in the different schema characteristics. Two major findings are that implicit schema information is far more discriminative and that applications involving schema information based on either types or properties alone will only capture between 63.5 and 88.1 % of the schema information contained in the data. As the level of discrimination depends on how data providers model and publish their data, we have conducted in a second step an investigation based on pay-level domains (PLDs) as well as the semantic level of vocabularies. Overall, we observe that most data providers combine up to 10 vocabularies to model their data and that every fifth PLD uses a highly structured schema.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. http://www.w3.org/DesignIssues/LinkedData.html, accessed: 23 March, 2013.

  2. http://www.w3.org/TR/rdf-syntax-grammar/, accessed: 23 March, 2013.

  3. http://www.lod-cloud.net/, accessed: 23 March, 2013.

  4. Taken from the famous Friend-of-a-Friend (FOAF) vocabulary for describing people and their relations. See: http://www.foaf-project.org/, accessed: 23 March, 2013.

  5. http://challenge.semanticweb.org/, accessed: 23 March, 2013.

  6. Please note, we use the letter \(r\) for sets of properties (inspired by the term relation), as \(p\) will be used to denote probabilities.

  7. Data sources are, e. g., static RDF documents and SPARQL endpoints [17].

  8. BTC 2012 data set: http://km.aifb.kit.edu/projects/btc-2012/, accessed: 25 March, 2013.

  9. Please note, that the efficient stream-based approach can cause an increase in the number of type sets and property sets, as well. This is due to the fact that a single missed type can cause the deduction of a new type set which does not actually occur in the lossless gold standard.

  10. This might not entirely be the case in semi-automated extraction of Linked data (e. g., DBPedia) or when using crowd sourcing for the creation of Linked Data (e. g., Freebase).

  11. http://www.w3.org/TR/2013/CR-vocab-data-cube-20130625/ accessed: 11 September 2013.

  12. https://github.com/cygri/make-void accessed: 21 March, 2013.

  13. http://rdfstats.sourceforge.net/ accessed: 21 March, 2013.

  14. http://www.webdatacommons.org/ accessed: 21 March, 2013.

  15. http://webdatacommons.org/2012-08/stats/how_to_get_the_data.html#toc2 accessed: 21 March, 2013.

  16. http://webdatacommons.org/ accessed: 21 March, 2013.

  17. http://lod-cloud.net/state/ accessed: 21 March, 2013.

References

  1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB’94, pp. 487–499. Morgan Kaufmann Publishers Inc., San Francisco (1994). http://dl.acm.org/citation.cfm?id=645920.672836

  2. Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing Linked Datasets with the Void Vocabulary. http://www.w3.org/TR/void/. Accessed 9 Mar 2013

  3. Auer, S., Demter, J., Martin, M., Lehmann, J.: Lodstats—an extensible framework for high-performance dataset analytics. In: Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) Knowledge Engineering and Knowledge Management, Lecture Notes in Computer Science, vol. 7603, pp. 353–362. Springer, Berlin (2012). doi:10.1007/978-3-642-33876-2_31

    Chapter  Google Scholar 

  4. Bizer, C.: The emerging web of linked data. IEEE Intell. Syst. 24(5), 87–92 (2009)

    Article  MATH  Google Scholar 

  5. Cheng, G., Ge, W., Qu, Y.: Falcons: searching and browsing entities on the semantic web. In: Proceedings of the 17th International Conference on World Wide Web, WWW’08, pp. 1101–1102. ACM, New York, (2008). doi:10.1145/1367497.1367676.

  6. Cheng, G., Qu, Y.: Term dependence on the semantic web. In: Proceedings of the 7th International Conference on the Semantic Web, ISWC’08, pp. 665–680. Springer, Berlin (2008). doi:10.1007/978-3-540-88564-1_42

  7. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, New York (1991)

    Book  Google Scholar 

  8. Ding, L., Finin, T.: Characterizing the semantic web on the web. In: The Semantic Web-ISWC 2006, 5th International Semantic Web Conference, ISWC 2006, Athens, 5–9 Nov 2006. Proceedings, Lecture Notes in Computer Science, vol. 4273, pp. 242–257. Springer, New York (2006)

  9. Ding, L., Finin, T.W., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari, P., Doshi, V., Sachs, J.: Swoogle: a search and metadata engine for the semantic web. In: CIKM ACM (2004)

  10. Ding, L., Shinavier, J., Shangguan, Z., McGuinness, D.L.: Sameas networks and beyond: analyzing deployment status and implications of owl: sameas in linked data. In: The Semantic Web—ISWC 2010: 9th International Semantic Web Conference, ISWC 2010, Shanghai, 7–11 Nov 2010. Revised Selected Papers, Part I, Lecture Notes in Computer Science, vol. 6496, pp. 145–160. Springer, New York (2010)

  11. Gottron, T., Knauf, M., Scheglmann, S., Scherp, A.: A systematic investigation of explicit and implicit schema information on the linked open data cloud. In: ESWC’13: Proceedings of the 10th Extended Semantic Web Conference (2013) (to appear)

  12. Gottron, T., Pickhardt, R.: A detailed analysis of the quality of stream-based schema construction on linked open data. In: CSWS’12: Proceedings of the Chinese Semantic Web Symposium (2012) (to appear)

  13. Gottron, T., Scherp, A., Krayer, B., Peters, A.: Get the google feeling: supporting users in finding—relevant sources of linked open data at web-scale. In: Semantic Web Challenge, Submission to the Billion Triple Track (2012)

  14. Gottron, T., Scherp, A., Krayer, B., Peters, A.: LODatio: using a schema-based index to support users in finding relevant sources of linked data. In: K-CAP’13: Proceedings of the Conference on Knowledge Capture (2013)

  15. Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.U., Umbrich, J.: Data summaries for on-demand queries over linked data. In: WWW, pp. 411–420. ACM (2010)

  16. Hausenblas, M., Halb, W., Raimond, Y., Feigenbaum, L., Ayers, D.: Scovo: Using statistics on the web of data. In: The semantic web: research and applications, 6th European Semantic Web Conference, ESWC 2009, Heraklion, Crete, 31 May–4 June 2009, Proceedings, Lecture Notes in Computer Science, vol. 5554, pp. 708–722. Springer, New York (2009)

  17. Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web: Theory and Technology. Morgan & Claypool, San Rafael (2011)

  18. Hinkle, D., Wiersma, W., Jurs, S.: Applied Statistics for the Behavioral Sciences. Houghton Mifflin, Boston (2003)

    MATH  Google Scholar 

  19. Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker, S.: An empirical survey of linked data conformance. Web Semantics: Science, Services and Agents on the World Wide Web 14, 14–44 (2012). doi:10.1016/j.websem.2012.02.001

    Article  Google Scholar 

  20. Konrath, M., Gottron, T., Scherp, A.: Schemex—web-scale indexed schema extraction of linked open data. In: Semantic Web Challenge, Submission to the Billion Triple Track (2011)

  21. Konrath, M., Gottron, T., Staab, S., Scherp, A.: Schemex—efficient construction of a data catalogue by stream-based indexing of linked data. Web Semantics: Science, Services and Agents on the World Wide Web 16(5), 52–58 (2012). doi:10.1016/j.websem.2012.06.002. http://www.sciencedirect.com/science/article/pii/S1570826812000716. The Semantic Web Challenge 2011

  22. Lorey, J., Abedjan, Z., Naumann, F., Böhm, C.: Rdf ontology (re-) engineering through large-scale data mining. In: Semantic Web Challenge (2011)

  23. Luo, X., Shinavier, J.: Entropy-based metrics for evaluating schema reuse. In: Gómez-Pérez, A., Yu, Y., Ding, Y. (eds.) The Semantic Web, Lecture Notes in Computer Science, vol. 5926, pp. 321–331. Springer, Berlin (2009). doi:10.1007/978-3-642-10871-6_22

    Google Scholar 

  24. Maduko, A., Anyanwu, K., Sheth, A., Schliekelman, P.: Graph summaries for subgraph frequency estimation. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) The Semantic Web: Research and Applications, Lecture Notes in Computer Science, vol. 5021, pp. 508–523. Springer, Berlin (2008). doi:10.1007/978-3-540-68234-9_38

    Chapter  Google Scholar 

  25. Neumann, T., Moerkotte, G.: Characteristic sets: accurate cardinality estimation for RDF queries with multiple joins. In: Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, 11–16 Apr 2011, Hannover, pp. 984–994 (2011)

  26. Schaible, J., Gottron, T., Scheglmann, S., Scherp, A.: LOVER: support for modeling data using linked open vocabularies. In: LWDM’13: 3rd International Workshop on Linked Web Data Management (2013) (to appear)

  27. Scheglmann, S., Gröner, G., Staab, S., Lämmel, R.: Incompleteness-aware programming with RDF data. In: Viegas, E., Breitman, K., Bishop, J. (eds.) Proceedings of the 2013 Workshop on Data Driven Functional Programming, DDFP 2013, Rome, 22 Jan 2013, pp. 11–14. ACM (2013)

  28. Shannon, C.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 and 623–656 (1948)

    Google Scholar 

  29. Wang, T.D., Parsia, B., Hendler, J.A.: A survey of the web ontology landscape. In: The Semantic Web—ISWC 2006, 5th International Semantic Web Conference, ISWC 2006, Athens, 5–9 Nov 2006. Proceedings, Lecture Notes in Computer Science, vol. 4273, pp. 682–694. Springer, New York (2006)

  30. Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: ICDM, pp. 721–724. IEEE Computer Society, Los Alamitos (2002)

  31. Yao, Y.: Information-theoretic measures for knowledge discovery and data mining. In: Karmeshu, (ed.) Entropy Measures, Maximum Entropy Principle and Emerging Applications, Studies in Fuzziness and Soft Computing, vol. 119, pp. 115–136. Springer, Berlin (2003). doi:10.1007/978-3-540-36212-8_6

    Chapter  Google Scholar 

Download references

Acknowledgments

The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013), REVEAL (Grant agree number 610928). Parts of the computations and experiments were conducted on resources provided by the HPI Future SOC Lab.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ansgar Scherp.

Additional information

Communicated by Haixun Wang and Jeffrey Xu Yu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gottron, T., Knauf, M. & Scherp, A. Analysis of schema structures in the Linked Open Data graph based on unique subject URIs, pay-level domains, and vocabulary usage. Distrib Parallel Databases 33, 515–553 (2015). https://doi.org/10.1007/s10619-014-7143-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-014-7143-0

Keywords

Navigation