Distributed Holistic Clustering on Linked Data

  • Markus Nentwig
  • Anika Groß
  • Maximilian Möller
  • Erhard Rahm
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10574)


Link discovery is an active field of research to support data integration in the Web of Data. Due to the huge size and number of available data sources, efficient and effective link discovery is a very challenging task. Common pairwise link discovery approaches do not scale to many sources with very large entity sets. We propose a distributed holistic approach to link many data sources based on a clustering of entities that represent the same real-world object. Our approach provides a compact and fused representation of entities, and can identify errors in existing links as well as many new links. We support distributed execution, show scalability for large real-world data sets and evaluate our methods with respect to effectiveness and efficiency for two domains.



This research was supported by the Deutsche Forschungsgemeinschaft (DFG) grant number RA 497/19-2.


  1. 1.
    Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink\(^{\rm TM}\): stream and batch processing in a single engine. IEEE Data Eng. Bull. 38(4), 28–38 (2015)Google Scholar
  2. 2.
    Faria, D., Jiménez-Ruiz, E., Pesquita, C., Santos, E., Couto, F.M.: Towards annotating potential incoherences in bioportal mappings. In: ISWC, pp. 17–32 (2014). doi: 10.1007/978-3-319-11915-1_2
  3. 3.
    Grütze, T., Böhm, C., Naumann, F.: Holistic and scalable ontology alignment for linked open data. In: WWW2012 Workshop on Linked Data on the Web (2012)Google Scholar
  4. 4.
    Hildebrandt, K., Panse, F., Wilcke, N., Ritter, N.: Large-Scale data pollution with apache spark. IEEE Trans. Big Data PP(99), 1–1 (2017). doi: 10.1109/TBDATA.2016.2637378 CrossRefGoogle Scholar
  5. 5.
    Hillner, S., Ngonga Ngomo, A.C.: Parallelizing LIMES for large-scale link discovery. In: I-Semantics 2011, pp. 9–16. ACM, New York (2011). doi: 10.1145/2063518.2063520
  6. 6.
    Isele, R., Jentzsch, A., Bizer, C.: Silk Server - Adding missing Links while consuming Linked Data. In: Proceedings of the First International Workshop on Consuming Linked Data, CEUR Workshop Proceedings, vol. 665 (2010).
  7. 7.
    Megdiche, I., Teste, O., Trojahn, C.: An extensible linear approach for holistic ontology matching. In: Groth, P., Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F., Gil, Y. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 393–410. Springer, Cham (2016). doi: 10.1007/978-3-319-46523-4_24 CrossRefGoogle Scholar
  8. 8.
    Nentwig, M., Groß, A., Rahm, E.: Holistic entity clustering for linked data. In: Proceedings ICDM Workshops, pp. 194–201. IEEE (2016). doi: 10.1109/ICDMW.2016.0035
  9. 9.
    Nentwig, M., Groß, A., Möller, M., Rahm, E.: Distributed holistic clustering on linked data. CoRR abs/1708.09299 (2017)Google Scholar
  10. 10.
    Nentwig, M., Hartung, M., Ngomo, A.N., Rahm, E.: A survey of current link discovery frameworks. Semant Web 8(3), 419–436 (2017). doi: 10.3233/SW-150210 CrossRefGoogle Scholar
  11. 11.
    Nentwig, M., Soru, T., Ngonga Ngomo, A.-C., Rahm, E.: LinkLion: a link repository for the web of data. In: Presutti, V., Blomqvist, E., Troncy, R., Sack, H., Papadakis, I., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8798, pp. 439–443. Springer, Cham (2014). doi: 10.1007/978-3-319-11955-7_63 Google Scholar
  12. 12.
    Ngonga Ngomo, A.-C., Sherif, M.A., Lyko, K.: Unsupervised link discovery through knowledge base repair. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 380–394. Springer, Cham (2014). doi: 10.1007/978-3-319-07443-6_26 CrossRefGoogle Scholar
  13. 13.
    Rahm, E.: The case for holistic data integration. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 11–27. Springer, Cham (2016). doi: 10.1007/978-3-319-44039-2_2 CrossRefGoogle Scholar
  14. 14.
    Saeedi, A., Peukert, E., Rahm, E.: Comparative evaluation of distributed clustering schemes for multi-source entity resolution. In: Kirikova, M., Nørvåg, K., Papadopoulos, G.A. (eds.) ADBIS 2017. LNCS, vol. 10509, pp. 278–293. Springer, Cham (2017). doi: 10.1007/978-3-319-66917-5_19 CrossRefGoogle Scholar
  15. 15.
    Thalhammer, A., Thoma, S., Harth, A., Studer, R.: Entity-centric data fusion on the web. In: Proceedings of the 28th ACM Conference on Hypertext and Social Media. ACM (2017). doi: 10.1145/3078714.3078717

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Markus Nentwig
    • 1
  • Anika Groß
    • 1
  • Maximilian Möller
    • 1
  • Erhard Rahm
    • 1
  1. 1.Database GroupUniversity of LeipzigLeipzigGermany

Personalised recommendations