Detecting Identical Entities in the Semantic Web Data

  • Michal Holub
  • Ondrej Proksa
  • Mária Bieliková
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8939)


Large amount of entities published by various sources inevitably introduces inaccuracies, mainly duplicated information. These can even be found within a single dataset. In this paper we propose a method for automatic discovery of identity relationship between two entities (also known as instance matching) in a dataset represented as a graph (e.g. in the Linked Data Cloud). Our method can be used for cleaning existing datasets from duplicates, validating of existing identity relationships between entities within a dataset, or for connecting different datasets using the owl:sameAs relationship. Our method is based on the analysis of sub-graphs formed by entities, their properties and existing relationships between them. It can learn a common similarity threshold for particular dataset, so it is adaptable to its different properties. We evaluated our method by conducting several experiments on data from the domains of public administration and digital libraries.


duplicates identity similarity relationship semantic web owl:sameAs Linked Data web of data 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Araujo, S., Tran, D.T., de Vries, A.P., Schwabe, D.: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data. In: Proc. of 15th Int. Workshop on the Web and Databases, WebDB 2012, pp. 25–30 (2012)Google Scholar
  2. 2.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: A Nucleus for a Web of Open Data. In: Aberer, K., et al. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  3. 3.
    Aumueller, D., Do, H., Massmann, S., Rahm, E.: Schema and Ontology Matching with COMA++. In: Proc. of 2005 ACM SIGMOD Int. Conf. on Management of Data, pp. 906–908. ACM Press (2005)Google Scholar
  4. 4.
    Holub, M., Móro, R., Ševcech, J., Lipták, M., Bieliková, M.: Annota: Towards Enriching Scientific Publications with Semantics and User Annotations. D-Lib Magazine 20(11/12) (2014)Google Scholar
  5. 5.
    Ferrara, A., Nikolov, A., Scharffe, F.: Data Linking for the Semantic Web. Int. Journal on Semantic Web and Information Systems 7(3), 46–76 (2011)CrossRefGoogle Scholar
  6. 6.
    Halpin, H., Hayes, P.J., McCusker, J.P., McGuinness, D.L., Thompson, H.S.: When owl:sameAs Isn’t the Same: An Analysis of Identity in Linked Data. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 305–320. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  7. 7.
    Harth, A., Hose, K., Schenkel, R.: Database Techniques for Linked Data Management. In: Proc. of 2012 ACM SIGMOD Int. Conf. on Management of Data, pp. 597–600. ACM Press (2012)Google Scholar
  8. 8.
    Lehmann, J., Schüppel, J., Auer, S.: Discovering Unknown Connections - the DBpedia Relationship Finder. In: Proc. of 1st Conf. on Social Semantic Web, CSSW, vol. 113, pp. 99–110 (2007)Google Scholar
  9. 9.
    Leitão, L., Calado, P., Herschel, M.: Efficient and Effective Duplicate Detection in Hierarchical Data. IEEE Trans. on Knowledge and Data Engineering 25(5), 1028–1041 (2013)CrossRefGoogle Scholar
  10. 10.
    Ley, M.: The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 1–10. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  11. 11.
    Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching. In: Proc. of 18th Int. Conf. on Data Engineering, pp. 117–128. IEEE CS (2002)Google Scholar
  12. 12.
    Ngomo, A.N., Auer, S.: LIMES: A Time-efficient Approach for Large-scale Link Discovery on the Web of Data. In: Proc. of 22nd Int. Joint Conf. on Artificial Intelligence, pp. 2312–2317. AAAI Press (2011)Google Scholar
  13. 13.
    Nikolov, A., d’Aquin, M., Motta, E.: Unsupervised Learning of Link Discovery Configuration. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 119–133. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  14. 14.
    Shvaiko, P., Euzenat, J.: A Survey of Schema-based Matching Approaches. In: Spaccapietra, S. (ed.) Journal on Data Semantics IV. LNCS, vol. 3730, pp. 146–171. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  15. 15.
    Shvaiko, P., Euzenat, J.: Ontology Matching: State of the Art and Future Challenges. IEEE Trans. on Knowledge and Data Engineering 25(1), 158–176 (2013)CrossRefGoogle Scholar
  16. 16.
    Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A Core of Semantic Knowledge. In: Proc. of 16th Int. Conf. on World Wide Web, pp. 697–706. ACM Press (2007)Google Scholar
  17. 17.
    Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Silk - A Link Discovery Framework for the Web of Data. In: Proc. of the Linked Data on the Web Workshop (LDOW2009), CEUR Workshop Proceedings, vol. 538 (2009)Google Scholar
  18. 18.
    Weikum, G., Theobald, M.: From Information to Knowledge: Harvesting Entities and Relationships from Web Sources. In: Proc. of 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 65–76. ACM Press (2010)Google Scholar
  19. 19.
    Zaïane, O.R., Chen, J., Goebel, R.: Mining Research Communities in Bibliographical Data. In: Zhang, H., et al. (eds.) WebKDD 2007. LNCS, vol. 5439, pp. 59–76. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  20. 20.
    Zhao, L., Ichsie, R.: Graph-based Ontology Analysis in the Linked Open Data. In: Proc. of 8th Int. Conf. on Semantic Systems, pp. 56–63. ACM Press (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Michal Holub
    • 1
  • Ondrej Proksa
    • 1
  • Mária Bieliková
    • 1
  1. 1.Institute of Informatics and Software Engineering, Faculty of Informatics and Information TechnologiesSlovak University of TechnologyBratislavaSlovakia

Personalised recommendations