Advertisement

SAKey: Scalable Almost Key Discovery in RDF Data

  • Danai Symeonidou
  • Vincent Armant
  • Nathalie Pernelle
  • Fatiha Saïs
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8796)

Abstract

Exploiting identity links among RDF resources allows applications to efficiently integrate data. Keys can be very useful to discover these identity links. A set of properties is considered as a key when its values uniquely identify resources. However, these keys are usually not available. The approaches that attempt to automatically discover keys can easily be overwhelmed by the size of the data and require clean data. We present SAKey, an approach that discovers keys in RDF data in an efficient way. To prune the search space, SAKey exploits characteristics of the data that are dynamically detected during the process. Furthermore, our approach can discover keys in datasets where erroneous data or duplicates exist (i.e., almost keys). The approach has been evaluated on different synthetic and real datasets. The results show both the relevance of almost keys and the efficiency of discovering them.

Keywords

Keys Identity Links Data Linking RDF OWL2 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: ICDE, pp. 952–963 (2009)Google Scholar
  2. 2.
    Atencia, M., Chein, M., Croitoru, M., Jerome David, M.L., Pernelle, N., Saïs, F., Scharffe, F., Symeonidou, D.: Defining key semantics for the rdf datasets: Experiments and evaluations. In: ICCS (2014)Google Scholar
  3. 3.
    Atencia, M., David, J., Scharffe, F.: Keys and pseudo-keys detection for web datasets cleansing and interlinking. In: ten Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) EKAW 2012. LNCS, vol. 7603, pp. 144–153. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  4. 4.
    Dechter, R.: Constraint Processing. Morgan Kaufmann Publishers Inc., San Francisco (2003)Google Scholar
  5. 5.
    Ferrara, A., Nikolov, A., Scharffe, F.: Data linking for the semantic web. Int. J. Semantic Web Inf. Syst. 7(3), 46–76 (2011)CrossRefGoogle Scholar
  6. 6.
    Heise, A., Jorge-Arnulfo, Q.-R., Abedjan, Z., Jentzsch, A., Naumann, F.: Scalable discovery of unique column combinations. VLDB 7(4), 301–312 (2013)Google Scholar
  7. 7.
    Hu, W., Chen, J., Qu, Y.: A self-training approach for resolving object coreference on the semantic web. In: WWW, pp. 87–96 (2011)Google Scholar
  8. 8.
    Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: Tane: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal 42(2), 100–111 (1999)CrossRefzbMATHGoogle Scholar
  9. 9.
    Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations. The IBM Research Symposia Series, pp. 85–103 (1972)Google Scholar
  10. 10.
    Nikolov, A., Motta, E.: Data linking: Capturing and utilising implicit schema-level relations. In: Proceedings of Linked Data on the Web workshop at WWW (2010)Google Scholar
  11. 11.
    Pernelle, N., Saïs, F., Symeonidou, D.: An automatic key discovery approach for data linking. J. Web Sem. 23, 16–30 (2013)CrossRefGoogle Scholar
  12. 12.
    Recommendation, W.: Owl2 web ontology language: Direct semantics. In: Motik, B., Patel-Schneider, P.F., Grau, B.C. (eds.), W3C (October 27, 2009), http://www.w3.org/TR/owl2-direct-semantics
  13. 13.
    Saïs, F., Pernelle, N., Rousset, M.C.: Combining a logical and a numerical method for data reconciliation. Journal on Data Semantics 12, 66–94 (2009)CrossRefGoogle Scholar
  14. 14.
    Sismanis, Y., Brown, P., Haas, P.J., Reinwald, B.: Gordian: efficient and scalable discovery of composite keys. In: VLDB, pp. 691–702 (2006)Google Scholar
  15. 15.
    Song, D., Heflin, J.: Automatically generating data linkages using a domain-independent candidate selection approach. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 649–664. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  16. 16.
    Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Discovering and maintaining links on the web of data. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 650–665. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  17. 17.
    Wang, D.Z., Dong, X.L., Sarma, A.D., Franklin, M.J., Halevy, A.Y.: Functional dependency generation and applications in pay-as-you-go data integration systems. In: WebDB (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Danai Symeonidou
    • 1
  • Vincent Armant
    • 2
  • Nathalie Pernelle
    • 1
  • Fatiha Saïs
    • 1
  1. 1.Laboratoire de Recherche en InformatiqueUniversity Paris SudFrance
  2. 2.Insight Center for Data AnalyticsUniversity College CorkIreland

Personalised recommendations