Using Link Features for Entity Clustering in Knowledge Graphs

  • Alieh SaeediEmail author
  • Eric Peukert
  • Erhard Rahm
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10843)


Knowledge graphs holistically integrate information about entities from multiple sources. A key step in the construction and maintenance of knowledge graphs is the clustering of equivalent entities from different sources. Previous approaches for such an entity clustering suffer from several problems, e.g., the creation of overlapping clusters or the inclusion of several entities from the same source within clusters. We therefore propose a new entity clustering algorithm CLIP that can be applied both to create entity clusters and to repair entity clusters determined with another clustering scheme. In contrast to previous approaches, CLIP not only uses the similarity between entities for clustering but also further features of entity links such as the so-called link strength. To achieve a good scalability we provide a parallel implementation of CLIP based on Apache Flink. Our evaluation for different datasets shows that the new approach can achieve substantially higher cluster quality than previous approaches.



This work is partially funded by the German Federal Ministry of Education and Research under project ScaDS Dresden/Leipzig (BMBF 01IS14014B).


  1. 1.
    Aslam, J.A., Pelekhov, E., Rus, D.: The star clustering algorithm for static and dynamic information organization. Graph Algorithms Appl. 5, 95–129 (2006)MathSciNetzbMATHGoogle Scholar
  2. 2.
    Calì, A., Lukasiewicz, T., Predoiu, L., Stuckenschmidt, H.: A framework for representing ontology mappings under probabilities and inconsistency. In: Proceedings URSW (2007)Google Scholar
  3. 3.
    Chierichetti, F., Dalvi, N., Kumar, R.: Correlation clustering in mapreduce. In: Proceedings KDD, pp. 641–650. ACM (2014)Google Scholar
  4. 4.
    Christen, P.: Data Matching. Springer, Heidelberg (2012). Scholar
  5. 5.
    Dong, X., et al. Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings KDD, pp. 601–610 (2014)Google Scholar
  6. 6.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  7. 7.
    Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)Google Scholar
  8. 8.
    Hassanzadeh, O., Miller, R.J.: Creating probabilistic databases from duplicated data. VLDB J. 18(5), 1141–1166 (2009)CrossRefGoogle Scholar
  9. 9.
    Junghanns, M., Petermann, A., Neumann, M., Rahm, E.: Management and analysis of big graph data: current systems and open challenges. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 457–505. Springer, Cham (2017). Scholar
  10. 10.
    Kolb, L., Thor, A., Rahm, E.: Dedoop: efficient deduplication with hadoop. PVLDB 5(12), 1878–1881 (2012)Google Scholar
  11. 11.
    Nentwig, M., Groß, A., Rahm, E.: Holistic entity clustering for linked data. In: Proceedings ICDMW, pp. 194–201. IEEE (2016)Google Scholar
  12. 12.
    Nentwig, M., Hartung, M., Ngonga Ngomo, A.-C., Rahm, E.: A survey of current link discovery frameworks. Semant. Web 8(3), 419–436 (2017)CrossRefGoogle Scholar
  13. 13.
    Ngonga Ngomo, A.-C., Sherif, M.A., Lyko, K.: Unsupervised link discovery through knowledge base repair. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 380–394. Springer, Cham (2014). Scholar
  14. 14.
    Paulheim, H.: Knowledge graph refinement: a survey of approaches and evaluation methods. Semant. web 8(3), 489–508 (2017)CrossRefGoogle Scholar
  15. 15.
    Pesquita, C., Faria, D., Santos, E., Couto, F.M.: To repair or not to repair: reconciling correctness and coherence in ontology reference alignments. In: Proceedings of the 8th International Conference on Ontology Matching, pp. 13–24 (2013)Google Scholar
  16. 16.
    Rahm, E.: The case for holistic data integration. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 11–27. Springer, Cham (2016). Scholar
  17. 17.
    Rostami, M.A., Saeedi, A., Peukert, E., Rahm, E.: Interactive visualization of large similarity graphs and entity resolution clusters. In: Proceedings EDBT (2018)Google Scholar
  18. 18.
    Saeedi, A., Peukert, E., Rahm, E.: Comparative evaluation of distributed clustering schemes for multi-source entity resolution. In: Kirikova, M., Nørvåg, K., Papadopoulos, G.A. (eds.) ADBIS 2017. LNCS, vol. 10509, pp. 278–293. Springer, Cham (2017). Scholar
  19. 19.
    Wang, Q., Gao, J., Christen, P.: A clustering-based framework for incrementally repairing entity resolution. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J.Z., Wang, R. (eds.) PAKDD 2016. LNCS (LNAI), vol. 9652, pp. 283–295. Springer, Cham (2016). Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.University of Leipzig and ScaDS Dresden/LeipzigLeipzigGermany

Personalised recommendations