Mining Cardinalities from Knowledge Bases

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10438)

Abstract

Cardinality is an important structural aspect of data that has not received enough attention in the context of RDF knowledge bases (KBs). Information about cardinalities can be useful for data users and knowledge engineers when writing queries, reusing or engineering KBs. Such cardinalities can be declared using OWL and RDF constraint languages as constraints on the usage of properties over instance data. However, their declaration is optional and consistency with the instance data is not ensured. In this paper, we address the problem of mining cardinality bounds for properties to discover structural characteristics of KBs, and use these bounds to assess completeness. Because KBs are incomplete and error-prone, we apply statistical methods for filtering property usage and for finding accurate and robust patterns. Accuracy of the cardinality patterns is ensured by properly handling equality axioms (owl:sameAs); and robustness by filtering outliers. We report an implementation of our algorithm with two variants using SPARQL 1.1 and Apache Spark, and their evaluation on real-world and synthetic data.

References

  1. 1.
    Bosch, T., Eckert, K.: Guidance, please! Towards a framework for RDF-based constraint languages. In: Proceedings of the International Conference on Dublin Core and Metadata Applications (2015)Google Scholar
  2. 2.
    Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15 (2009)CrossRefGoogle Scholar
  3. 3.
    Ferrarotti, F., Hartmann, S., Link, S.: Efficiency frontiers of XML cardinality constraints. Data Knowl. Eng. 87, 297–319 (2013)CrossRefGoogle Scholar
  4. 4.
    Fleischhacker, D., Paulheim, H., Bryl, V., Völker, J., Bizer, C.: Detecting errors in numerical linked data using cross-checked outlier detection. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 357–372. Springer, Cham (2014). doi:10.1007/978-3-319-11964-9_23 Google Scholar
  5. 5.
    Galárraga, L., Razniewski, S., Amarilli, A., Suchanek, F.M.: Predicting completeness in knowledge bases. In: WSDM, pp. 375–383. ACM (2017)Google Scholar
  6. 6.
    Glimm, B., Hogan, A., Krötzsch, M., Polleres, A.: OWL: yet to arrive on the web of data? In: LDOW, CEUR Workshop Proceedings, vol. 937. CEUR-WS.org (2012)Google Scholar
  7. 7.
    Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic web. In: LDOW, CEUR Workshop Proceedings, vol. 628. CEUR-WS.org (2010)Google Scholar
  8. 8.
    Kellou-Menouer, K., Kedad, Z.: Evaluating the gap between an RDF dataset and its schema. In: Jeusfeld, M.A., Karlapalem, K. (eds.) ER 2015. LNCS, vol. 9382, pp. 283–292. Springer, Cham (2015). doi:10.1007/978-3-319-25747-1_28 CrossRefGoogle Scholar
  9. 9.
    Lausen, G., Meier, M., Schmidt, M.: SPARQLing constraints for RDF. In: EDBT, pp. 499–509 (2008)Google Scholar
  10. 10.
    Liddle, S.W., Embley, D.W., Woodfield, S.N.: Cardinality constraints in semantic data models. Data Knowl. Eng. 11(3), 235–270 (1993)CrossRefMATHGoogle Scholar
  11. 11.
    Motik, B., Horrocks, I., Sattler, U.: Bridging the gap between OWL and relational databases. Web Seman.: Sci. Serv. Agents World Wide Web 7(2), 74–89 (2009)CrossRefGoogle Scholar
  12. 12.
    Motik, B., Nenov, Y., Piro, R.E.F., Horrocks, I.: Handling Owl:sameAs via rewriting. In: AAAI, pp. 231–237. AAAI Press (2015)Google Scholar
  13. 13.
    Motik, B., Patel-Schneider, P.F., Parsia, B.: OWL 2 Web Ontology Language structural specification and functional-style syntax, 2nd edn (2012). http://www.w3.org/TR/2012/REC-owl2-syntax-20121211/
  14. 14.
    Muñoz, E.: On learnability of constraints from RDF data. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 834–844. Springer, Cham (2016). doi:10.1007/978-3-319-34129-3_52 CrossRefGoogle Scholar
  15. 15.
    Neumann, T., Moerkotte, G.: Characteristic sets: accurate cardinality estimation for RDF queries with multiple joins. In: ICDE, pp. 984–994. IEEE Computer Society (2011)Google Scholar
  16. 16.
    Paulheim, H.: Knowledge graph refinement: a survey of approaches and evaluation methods. Semant. Web 8(3), 489–508 (2017)CrossRefGoogle Scholar
  17. 17.
    Paulheim, H., Bizer, C.: Improving the quality of linked data using statistical distributions. Int. J. Semant. Web Inf. Syst. 10(2), 63–86 (2014)CrossRefGoogle Scholar
  18. 18.
    Pearson, R.K.: Mining Imperfect Data: Dealing with Contamination and Incomplete Records. Society for Industrial and Applied Mathematics, Philadelphia (2005)CrossRefMATHGoogle Scholar
  19. 19.
    Prud’hommeaux, E., Gayo, J.E.L., Solbrig, H.R.: Shape expressions: an RDF validation and transformation language. In: SEMANTICS, pp. 32–40. ACM (2014)Google Scholar
  20. 20.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)CrossRefMATHGoogle Scholar
  21. 21.
    Rivero, C.R., Hernández, I., Ruiz, D., Corchuelo, R.: Towards discovering ontological models from big RDF data. In: Castano, S., Vassiliadis, P., Lakshmanan, L.V., Lee, M.L. (eds.) ER 2012. LNCS, vol. 7518, pp. 131–140. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33999-8_16 CrossRefGoogle Scholar
  22. 22.
    Rosner, B.: Percentage points for a generalized ESD many-outlier procedure. Technometrics 25(2), 165–172 (1983)CrossRefMATHGoogle Scholar
  23. 23.
    Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd and 3rd edn. Pearson Education, London (2009)Google Scholar
  24. 24.
    Ryman, A.G., Hors, A.L., Speicher, S.: OSLC resource shape: a language for defining constraints on linked data. In: Proceedings of the WWW 2013 Workshop on Linked Data on the Web (2013)Google Scholar
  25. 25.
    Schenner, G., Bischof, S., Polleres, A., Steyskal, S.: Integrating distributed configurations with RDFS and SPARQL. In: Configuration Workshop, CEUR Workshop Proceedings, vol. 1220, pp. 9–15. CEUR-WS.org (2014)Google Scholar
  26. 26.
    Schmidt, M., Lausen, G.: Pleasantly consuming Linked Data with RDF data descriptions. In: COLD. CEUR-WS.org (2013)Google Scholar
  27. 27.
    Schmidt, M., Meier, M., Lausen, G.: Foundations of SPARQL query optimization. In: ICDT, pp. 4–33. ACM (2010)Google Scholar
  28. 28.
    Thalheim, B.: Fundamentals of cardinality constraints. In: Pernul, G., Tjoa, A.M. (eds.) ER 1992. LNCS, vol. 645, pp. 7–23. Springer, Heidelberg (1992). doi:10.1007/3-540-56023-8_3 CrossRefGoogle Scholar
  29. 29.
    Töpper, G., Knuth, M., Sack, H.: DBpedia ontology enrichment for inconsistency detection. In: I-SEMANTICS, pp. 33–40. ACM (2012)Google Scholar
  30. 30.
    Völker, J., Niepert, M.: Statistical schema induction. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., Leenheer, P., Pan, J. (eds.) ESWC 2011. LNCS, vol. 6643, pp. 124–138. Springer, Heidelberg (2011). doi:10.1007/978-3-642-21034-1_9 CrossRefGoogle Scholar
  31. 31.
    Wienand, D., Paulheim, H.: Detecting incorrect numerical data in DBpedia. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 504–518. Springer, Cham (2014). doi:10.1007/978-3-319-07443-6_34 CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Fujitsu Ireland LimitedDublinIreland
  2. 2.Insight Centre for Data AnalyticsNational University of IrelandGalwayIreland

Personalised recommendations