Analysis of schema structures in the Linked Open Data graph based on unique subject URIs, pay-level domains, and vocabulary usage

Gottron, Thomas; Knauf, Malte; Scherp, Ansgar

doi:10.1007/s10619-014-7143-0

Analysis of schema structures in the Linked Open Data graph based on unique subject URIs, pay-level domains, and vocabulary usage

Published: 11 February 2014

Volume 33, pages 515–553, (2015)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Thomas Gottron¹,
Malte Knauf¹ &
Ansgar Scherp²

462 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

The Linked Open Data (LOD) graph represents a web-scale distributed knowledge graph interlinking information about entities across various domains. A core concept is the lack of pre-defined schema which actually allows for flexibly modelling data from all kinds of domains. However, Linked Data does exhibit schema information in a twofold way: by explicitly attaching RDF types to the entities and implicitly by using domain-specific properties to describe the entities. In this paper, we present and apply different techniques for investigating the schematic information encoded in the LOD graph at different levels of granularity. We investigate different information theoretic properties of so-called Unique Subject URIs (USUs) and measure the correlation between the properties and types that can be observed for USUs on a large-scale semantic graph data set. Our analysis provides insights into the information encoded in the different schema characteristics. Two major findings are that implicit schema information is far more discriminative and that applications involving schema information based on either types or properties alone will only capture between 63.5 and 88.1 % of the schema information contained in the data. As the level of discrimination depends on how data providers model and publish their data, we have conducted in a second step an investigation based on pay-level domains (PLDs) as well as the semantic level of vocabularies. Overall, we observe that most data providers combine up to 10 vocabularies to model their data and that every fifth PLD uses a highly structured schema.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Investigation of Explicit and Implicit Schema Information on the Linked Open Data Cloud

Leveraging Linked Data to Discover Semantic Relations Within Data Sources

Adoption of the Linked Data Best Practices in Different Topical Domains

Notes

http://www.w3.org/DesignIssues/LinkedData.html, accessed: 23 March, 2013.
http://www.w3.org/TR/rdf-syntax-grammar/, accessed: 23 March, 2013.
http://www.lod-cloud.net/, accessed: 23 March, 2013.
Taken from the famous Friend-of-a-Friend (FOAF) vocabulary for describing people and their relations. See: http://www.foaf-project.org/, accessed: 23 March, 2013.
http://challenge.semanticweb.org/, accessed: 23 March, 2013.
Please note, we use the letter \(r\) for sets of properties (inspired by the term relation), as \(p\) will be used to denote probabilities.
Data sources are, e. g., static RDF documents and SPARQL endpoints [17].
BTC 2012 data set: http://km.aifb.kit.edu/projects/btc-2012/, accessed: 25 March, 2013.
Please note, that the efficient stream-based approach can cause an increase in the number of type sets and property sets, as well. This is due to the fact that a single missed type can cause the deduction of a new type set which does not actually occur in the lossless gold standard.
This might not entirely be the case in semi-automated extraction of Linked data (e. g., DBPedia) or when using crowd sourcing for the creation of Linked Data (e. g., Freebase).
http://www.w3.org/TR/2013/CR-vocab-data-cube-20130625/ accessed: 11 September 2013.
https://github.com/cygri/make-void accessed: 21 March, 2013.
http://rdfstats.sourceforge.net/ accessed: 21 March, 2013.
http://www.webdatacommons.org/ accessed: 21 March, 2013.
http://webdatacommons.org/2012-08/stats/how_to_get_the_data.html#toc2 accessed: 21 March, 2013.
http://webdatacommons.org/ accessed: 21 March, 2013.
http://lod-cloud.net/state/ accessed: 21 March, 2013.

References

Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB’94, pp. 487–499. Morgan Kaufmann Publishers Inc., San Francisco (1994). http://dl.acm.org/citation.cfm?id=645920.672836
Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing Linked Datasets with the Void Vocabulary. http://www.w3.org/TR/void/. Accessed 9 Mar 2013
Auer, S., Demter, J., Martin, M., Lehmann, J.: Lodstats—an extensible framework for high-performance dataset analytics. In: Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) Knowledge Engineering and Knowledge Management, Lecture Notes in Computer Science, vol. 7603, pp. 353–362. Springer, Berlin (2012). doi:10.1007/978-3-642-33876-2_31
Chapter Google Scholar
Bizer, C.: The emerging web of linked data. IEEE Intell. Syst. 24(5), 87–92 (2009)
Article MATH Google Scholar
Cheng, G., Ge, W., Qu, Y.: Falcons: searching and browsing entities on the semantic web. In: Proceedings of the 17th International Conference on World Wide Web, WWW’08, pp. 1101–1102. ACM, New York, (2008). doi:10.1145/1367497.1367676.
Cheng, G., Qu, Y.: Term dependence on the semantic web. In: Proceedings of the 7th International Conference on the Semantic Web, ISWC’08, pp. 665–680. Springer, Berlin (2008). doi:10.1007/978-3-540-88564-1_42
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, New York (1991)
Book Google Scholar
Ding, L., Finin, T.: Characterizing the semantic web on the web. In: The Semantic Web-ISWC 2006, 5th International Semantic Web Conference, ISWC 2006, Athens, 5–9 Nov 2006. Proceedings, Lecture Notes in Computer Science, vol. 4273, pp. 242–257. Springer, New York (2006)
Ding, L., Finin, T.W., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari, P., Doshi, V., Sachs, J.: Swoogle: a search and metadata engine for the semantic web. In: CIKM ACM (2004)
Ding, L., Shinavier, J., Shangguan, Z., McGuinness, D.L.: Sameas networks and beyond: analyzing deployment status and implications of owl: sameas in linked data. In: The Semantic Web—ISWC 2010: 9th International Semantic Web Conference, ISWC 2010, Shanghai, 7–11 Nov 2010. Revised Selected Papers, Part I, Lecture Notes in Computer Science, vol. 6496, pp. 145–160. Springer, New York (2010)
Gottron, T., Knauf, M., Scheglmann, S., Scherp, A.: A systematic investigation of explicit and implicit schema information on the linked open data cloud. In: ESWC’13: Proceedings of the 10th Extended Semantic Web Conference (2013) (to appear)
Gottron, T., Pickhardt, R.: A detailed analysis of the quality of stream-based schema construction on linked open data. In: CSWS’12: Proceedings of the Chinese Semantic Web Symposium (2012) (to appear)
Gottron, T., Scherp, A., Krayer, B., Peters, A.: Get the google feeling: supporting users in finding—relevant sources of linked open data at web-scale. In: Semantic Web Challenge, Submission to the Billion Triple Track (2012)
Gottron, T., Scherp, A., Krayer, B., Peters, A.: LODatio: using a schema-based index to support users in finding relevant sources of linked data. In: K-CAP’13: Proceedings of the Conference on Knowledge Capture (2013)
Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.U., Umbrich, J.: Data summaries for on-demand queries over linked data. In: WWW, pp. 411–420. ACM (2010)
Hausenblas, M., Halb, W., Raimond, Y., Feigenbaum, L., Ayers, D.: Scovo: Using statistics on the web of data. In: The semantic web: research and applications, 6th European Semantic Web Conference, ESWC 2009, Heraklion, Crete, 31 May–4 June 2009, Proceedings, Lecture Notes in Computer Science, vol. 5554, pp. 708–722. Springer, New York (2009)
Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web: Theory and Technology. Morgan & Claypool, San Rafael (2011)
Hinkle, D., Wiersma, W., Jurs, S.: Applied Statistics for the Behavioral Sciences. Houghton Mifflin, Boston (2003)
MATH Google Scholar
Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker, S.: An empirical survey of linked data conformance. Web Semantics: Science, Services and Agents on the World Wide Web 14, 14–44 (2012). doi:10.1016/j.websem.2012.02.001
Article Google Scholar
Konrath, M., Gottron, T., Scherp, A.: Schemex—web-scale indexed schema extraction of linked open data. In: Semantic Web Challenge, Submission to the Billion Triple Track (2011)
Konrath, M., Gottron, T., Staab, S., Scherp, A.: Schemex—efficient construction of a data catalogue by stream-based indexing of linked data. Web Semantics: Science, Services and Agents on the World Wide Web 16(5), 52–58 (2012). doi:10.1016/j.websem.2012.06.002. http://www.sciencedirect.com/science/article/pii/S1570826812000716. The Semantic Web Challenge 2011
Lorey, J., Abedjan, Z., Naumann, F., Böhm, C.: Rdf ontology (re-) engineering through large-scale data mining. In: Semantic Web Challenge (2011)
Luo, X., Shinavier, J.: Entropy-based metrics for evaluating schema reuse. In: Gómez-Pérez, A., Yu, Y., Ding, Y. (eds.) The Semantic Web, Lecture Notes in Computer Science, vol. 5926, pp. 321–331. Springer, Berlin (2009). doi:10.1007/978-3-642-10871-6_22
Google Scholar
Maduko, A., Anyanwu, K., Sheth, A., Schliekelman, P.: Graph summaries for subgraph frequency estimation. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) The Semantic Web: Research and Applications, Lecture Notes in Computer Science, vol. 5021, pp. 508–523. Springer, Berlin (2008). doi:10.1007/978-3-540-68234-9_38
Chapter Google Scholar
Neumann, T., Moerkotte, G.: Characteristic sets: accurate cardinality estimation for RDF queries with multiple joins. In: Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, 11–16 Apr 2011, Hannover, pp. 984–994 (2011)
Schaible, J., Gottron, T., Scheglmann, S., Scherp, A.: LOVER: support for modeling data using linked open vocabularies. In: LWDM’13: 3rd International Workshop on Linked Web Data Management (2013) (to appear)
Scheglmann, S., Gröner, G., Staab, S., Lämmel, R.: Incompleteness-aware programming with RDF data. In: Viegas, E., Breitman, K., Bishop, J. (eds.) Proceedings of the 2013 Workshop on Data Driven Functional Programming, DDFP 2013, Rome, 22 Jan 2013, pp. 11–14. ACM (2013)
Shannon, C.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 and 623–656 (1948)
Google Scholar
Wang, T.D., Parsia, B., Hendler, J.A.: A survey of the web ontology landscape. In: The Semantic Web—ISWC 2006, 5th International Semantic Web Conference, ISWC 2006, Athens, 5–9 Nov 2006. Proceedings, Lecture Notes in Computer Science, vol. 4273, pp. 682–694. Springer, New York (2006)
Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: ICDM, pp. 721–724. IEEE Computer Society, Los Alamitos (2002)
Yao, Y.: Information-theoretic measures for knowledge discovery and data mining. In: Karmeshu, (ed.) Entropy Measures, Maximum Entropy Principle and Emerging Applications, Studies in Fuzziness and Soft Computing, vol. 119, pp. 115–136. Springer, Berlin (2003). doi:10.1007/978-3-540-36212-8_6
Chapter Google Scholar

Download references

Acknowledgments

The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013), REVEAL (Grant agree number 610928). Parts of the computations and experiments were conducted on resources provided by the HPI Future SOC Lab.

Author information

Authors and Affiliations

Institute for Web Science and Technologies (WeST), University of Koblenz-Landau, 56070 , Koblenz, Germany
Thomas Gottron & Malte Knauf
Kiel University and Leibniz Information Center for Economics, Kiel, Germany
Ansgar Scherp

Authors

Thomas Gottron
View author publications
You can also search for this author in PubMed Google Scholar
Malte Knauf
View author publications
You can also search for this author in PubMed Google Scholar
Ansgar Scherp
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ansgar Scherp.

Additional information

Communicated by Haixun Wang and Jeffrey Xu Yu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gottron, T., Knauf, M. & Scherp, A. Analysis of schema structures in the Linked Open Data graph based on unique subject URIs, pay-level domains, and vocabulary usage. Distrib Parallel Databases 33, 515–553 (2015). https://doi.org/10.1007/s10619-014-7143-0

Download citation

Published: 11 February 2014
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10619-014-7143-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis of schema structures in the Linked Open Data graph based on unique subject URIs, pay-level domains, and vocabulary usage

Abstract

Access this article

Similar content being viewed by others

A Systematic Investigation of Explicit and Implicit Schema Information on the Linked Open Data Cloud

Leveraging Linked Data to Discover Semantic Relations Within Data Sources

Adoption of the Linked Data Best Practices in Different Topical Domains

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Analysis of schema structures in the Linked Open Data graph based on unique subject URIs, pay-level domains, and vocabulary usage

Abstract

Access this article

Similar content being viewed by others

A Systematic Investigation of Explicit and Implicit Schema Information on the Linked Open Data Cloud

Leveraging Linked Data to Discover Semantic Relations Within Data Sources

Adoption of the Linked Data Best Practices in Different Topical Domains

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation