Distributed and Parallel Databases

, Volume 33, Issue 4, pp 515–553 | Cite as

Analysis of schema structures in the Linked Open Data graph based on unique subject URIs, pay-level domains, and vocabulary usage

Article

Abstract

The Linked Open Data (LOD) graph represents a web-scale distributed knowledge graph interlinking information about entities across various domains. A core concept is the lack of pre-defined schema which actually allows for flexibly modelling data from all kinds of domains. However, Linked Data does exhibit schema information in a twofold way: by explicitly attaching RDF types to the entities and implicitly by using domain-specific properties to describe the entities. In this paper, we present and apply different techniques for investigating the schematic information encoded in the LOD graph at different levels of granularity. We investigate different information theoretic properties of so-called Unique Subject URIs (USUs) and measure the correlation between the properties and types that can be observed for USUs on a large-scale semantic graph data set. Our analysis provides insights into the information encoded in the different schema characteristics. Two major findings are that implicit schema information is far more discriminative and that applications involving schema information based on either types or properties alone will only capture between 63.5 and 88.1 % of the schema information contained in the data. As the level of discrimination depends on how data providers model and publish their data, we have conducted in a second step an investigation based on pay-level domains (PLDs) as well as the semantic level of vocabularies. Overall, we observe that most data providers combine up to 10 vocabularies to model their data and that every fifth PLD uses a highly structured schema.

Keywords

Linked Open Data Schema analysis Information  Entropy 

References

  1. 1.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB’94, pp. 487–499. Morgan Kaufmann Publishers Inc., San Francisco (1994). http://dl.acm.org/citation.cfm?id=645920.672836
  2. 2.
    Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing Linked Datasets with the Void Vocabulary. http://www.w3.org/TR/void/. Accessed 9 Mar 2013
  3. 3.
    Auer, S., Demter, J., Martin, M., Lehmann, J.: Lodstats—an extensible framework for high-performance dataset analytics. In: Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) Knowledge Engineering and Knowledge Management, Lecture Notes in Computer Science, vol. 7603, pp. 353–362. Springer, Berlin (2012). doi:10.1007/978-3-642-33876-2_31 CrossRefGoogle Scholar
  4. 4.
    Bizer, C.: The emerging web of linked data. IEEE Intell. Syst. 24(5), 87–92 (2009)CrossRefMATHGoogle Scholar
  5. 5.
    Cheng, G., Ge, W., Qu, Y.: Falcons: searching and browsing entities on the semantic web. In: Proceedings of the 17th International Conference on World Wide Web, WWW’08, pp. 1101–1102. ACM, New York, (2008). doi:10.1145/1367497.1367676.
  6. 6.
    Cheng, G., Qu, Y.: Term dependence on the semantic web. In: Proceedings of the 7th International Conference on the Semantic Web, ISWC’08, pp. 665–680. Springer, Berlin (2008). doi:10.1007/978-3-540-88564-1_42
  7. 7.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, New York (1991)CrossRefGoogle Scholar
  8. 8.
    Ding, L., Finin, T.: Characterizing the semantic web on the web. In: The Semantic Web-ISWC 2006, 5th International Semantic Web Conference, ISWC 2006, Athens, 5–9 Nov 2006. Proceedings, Lecture Notes in Computer Science, vol. 4273, pp. 242–257. Springer, New York (2006)Google Scholar
  9. 9.
    Ding, L., Finin, T.W., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari, P., Doshi, V., Sachs, J.: Swoogle: a search and metadata engine for the semantic web. In: CIKM ACM (2004)Google Scholar
  10. 10.
    Ding, L., Shinavier, J., Shangguan, Z., McGuinness, D.L.: Sameas networks and beyond: analyzing deployment status and implications of owl: sameas in linked data. In: The Semantic Web—ISWC 2010: 9th International Semantic Web Conference, ISWC 2010, Shanghai, 7–11 Nov 2010. Revised Selected Papers, Part I, Lecture Notes in Computer Science, vol. 6496, pp. 145–160. Springer, New York (2010)Google Scholar
  11. 11.
    Gottron, T., Knauf, M., Scheglmann, S., Scherp, A.: A systematic investigation of explicit and implicit schema information on the linked open data cloud. In: ESWC’13: Proceedings of the 10th Extended Semantic Web Conference (2013) (to appear)Google Scholar
  12. 12.
    Gottron, T., Pickhardt, R.: A detailed analysis of the quality of stream-based schema construction on linked open data. In: CSWS’12: Proceedings of the Chinese Semantic Web Symposium (2012) (to appear)Google Scholar
  13. 13.
    Gottron, T., Scherp, A., Krayer, B., Peters, A.: Get the google feeling: supporting users in finding—relevant sources of linked open data at web-scale. In: Semantic Web Challenge, Submission to the Billion Triple Track (2012)Google Scholar
  14. 14.
    Gottron, T., Scherp, A., Krayer, B., Peters, A.: LODatio: using a schema-based index to support users in finding relevant sources of linked data. In: K-CAP’13: Proceedings of the Conference on Knowledge Capture (2013)Google Scholar
  15. 15.
    Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.U., Umbrich, J.: Data summaries for on-demand queries over linked data. In: WWW, pp. 411–420. ACM (2010)Google Scholar
  16. 16.
    Hausenblas, M., Halb, W., Raimond, Y., Feigenbaum, L., Ayers, D.: Scovo: Using statistics on the web of data. In: The semantic web: research and applications, 6th European Semantic Web Conference, ESWC 2009, Heraklion, Crete, 31 May–4 June 2009, Proceedings, Lecture Notes in Computer Science, vol. 5554, pp. 708–722. Springer, New York (2009)Google Scholar
  17. 17.
    Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web: Theory and Technology. Morgan & Claypool, San Rafael (2011)Google Scholar
  18. 18.
    Hinkle, D., Wiersma, W., Jurs, S.: Applied Statistics for the Behavioral Sciences. Houghton Mifflin, Boston (2003)MATHGoogle Scholar
  19. 19.
    Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker, S.: An empirical survey of linked data conformance. Web Semantics: Science, Services and Agents on the World Wide Web 14, 14–44 (2012). doi:10.1016/j.websem.2012.02.001 CrossRefGoogle Scholar
  20. 20.
    Konrath, M., Gottron, T., Scherp, A.: Schemex—web-scale indexed schema extraction of linked open data. In: Semantic Web Challenge, Submission to the Billion Triple Track (2011)Google Scholar
  21. 21.
    Konrath, M., Gottron, T., Staab, S., Scherp, A.: Schemex—efficient construction of a data catalogue by stream-based indexing of linked data. Web Semantics: Science, Services and Agents on the World Wide Web 16(5), 52–58 (2012). doi:10.1016/j.websem.2012.06.002. http://www.sciencedirect.com/science/article/pii/S1570826812000716. The Semantic Web Challenge 2011
  22. 22.
    Lorey, J., Abedjan, Z., Naumann, F., Böhm, C.: Rdf ontology (re-) engineering through large-scale data mining. In: Semantic Web Challenge (2011)Google Scholar
  23. 23.
    Luo, X., Shinavier, J.: Entropy-based metrics for evaluating schema reuse. In: Gómez-Pérez, A., Yu, Y., Ding, Y. (eds.) The Semantic Web, Lecture Notes in Computer Science, vol. 5926, pp. 321–331. Springer, Berlin (2009). doi:10.1007/978-3-642-10871-6_22 Google Scholar
  24. 24.
    Maduko, A., Anyanwu, K., Sheth, A., Schliekelman, P.: Graph summaries for subgraph frequency estimation. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) The Semantic Web: Research and Applications, Lecture Notes in Computer Science, vol. 5021, pp. 508–523. Springer, Berlin (2008). doi:10.1007/978-3-540-68234-9_38 CrossRefGoogle Scholar
  25. 25.
    Neumann, T., Moerkotte, G.: Characteristic sets: accurate cardinality estimation for RDF queries with multiple joins. In: Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, 11–16 Apr 2011, Hannover, pp. 984–994 (2011)Google Scholar
  26. 26.
    Schaible, J., Gottron, T., Scheglmann, S., Scherp, A.: LOVER: support for modeling data using linked open vocabularies. In: LWDM’13: 3rd International Workshop on Linked Web Data Management (2013) (to appear)Google Scholar
  27. 27.
    Scheglmann, S., Gröner, G., Staab, S., Lämmel, R.: Incompleteness-aware programming with RDF data. In: Viegas, E., Breitman, K., Bishop, J. (eds.) Proceedings of the 2013 Workshop on Data Driven Functional Programming, DDFP 2013, Rome, 22 Jan 2013, pp. 11–14. ACM (2013)Google Scholar
  28. 28.
    Shannon, C.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 and 623–656 (1948)Google Scholar
  29. 29.
    Wang, T.D., Parsia, B., Hendler, J.A.: A survey of the web ontology landscape. In: The Semantic Web—ISWC 2006, 5th International Semantic Web Conference, ISWC 2006, Athens, 5–9 Nov 2006. Proceedings, Lecture Notes in Computer Science, vol. 4273, pp. 682–694. Springer, New York (2006)Google Scholar
  30. 30.
    Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: ICDM, pp. 721–724. IEEE Computer Society, Los Alamitos (2002)Google Scholar
  31. 31.
    Yao, Y.: Information-theoretic measures for knowledge discovery and data mining. In: Karmeshu, (ed.) Entropy Measures, Maximum Entropy Principle and Emerging Applications, Studies in Fuzziness and Soft Computing, vol. 119, pp. 115–136. Springer, Berlin (2003). doi:10.1007/978-3-540-36212-8_6 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Institute for Web Science and Technologies (WeST)University of Koblenz-LandauKoblenzGermany
  2. 2.Kiel University and Leibniz Information Center for EconomicsKielGermany

Personalised recommendations