Comparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution

  • Alieh SaeediEmail author
  • Eric Peukert
  • Erhard Rahm
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10509)


Entity resolution identifies semantically equivalent entities, e.g., describing the same product or customer. It is especially challenging for big data applications where large volumes of data from many sources have to be matched and integrated. Entity resolution for multiple data sources is best addressed by clustering schemes that group all matching entities within clusters. While there are many possible clustering schemes for entity resolution, their relative suitability and scalability is still unclear. We therefore implemented and comparatively evaluate distributed versions of six clustering schemes based on Apache Flink within a new entity resolution framework called Famer. Our evaluation for different real-life and synthetically generated datasets considers both the match quality as well as the scalability for different number of machines and data sizes.



This work was partly funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B). Also, evaluations partly performed on the Galaxy-Infrastructure at Leipzig University.


  1. 1.
    Aslam, J., Pelekhov, E., Rus, D.: The star clustering algorithm for static and dynamic information organization. J. Graph Algorithms Appl. 8, 95–129 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Bansal, N., Blum, A., Chawla, S.: Correlation clustering. In: Proceedings of the Foundations of Computer Science, pp. 238–247. IEEE (2002)Google Scholar
  3. 3.
    Chierichetti, F., Dalvi, N., Kumar, R.: Correlation clustering in MapReduce. In: Proceedings of the ACM SIGKDD Conference, pp. 641–650 (2014)Google Scholar
  4. 4.
    Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  5. 5.
    Christen, P., Vatsalan, D.: Flexible and extensible generation and corruption of personal data. In: Proceedings of CIKM, pp. 1165–1168 (2013)Google Scholar
  6. 6.
    Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 4 (2007)CrossRefGoogle Scholar
  7. 7.
    Hassanzadeh, O., Chiang, F., Lee, H., Miller, R.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)Google Scholar
  8. 8.
    Hassanzadeh, O., Miller, R.: Creating probabilistic databases from duplicated data. VLDB J. 18(5), 1141–1166 (2009)CrossRefGoogle Scholar
  9. 9.
    Hildebrandt, K., Panse, F., Wilcke, N., Ritter, N.: Large-scale data pollution with Apache Spark. IEEE Trans. Big Data (2017)Google Scholar
  10. 10.
    Junghanns, M., Petermann, A., Neumann, M., Rahm, E.: Management and analysis of big graph data: current systems and open challenges. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 457–505. Springer, Cham (2017). doi: 10.1007/978-3-319-49340-4_14 CrossRefGoogle Scholar
  11. 11.
    Junghanns, M., Petermann, A., Teichmann, N., Gómez, K., Rahm, E.: Analyzing extended property graphs with Apache Flink. In: Proceedings of the ACM SIGMOD Workshop on Network Data Analytics (2016)Google Scholar
  12. 12.
    Kolb, L., Thor, A., Rahm, E.: Dedoop: efficient deduplication with Hadoop. PVLDB 5(12), 1878–1881 (2012)Google Scholar
  13. 13.
    Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)CrossRefGoogle Scholar
  14. 14.
    Mestre, D., Pires, C., Nascimento, D., de Queriroz, A., Santos, V., Araujo, T.: An efficient Spark-based adaptive windowing for entity matching. J. Syst. Softw. 128, 1–10 (2017)CrossRefGoogle Scholar
  15. 15.
    Nentwig, M., Groß, A., Rahm, E.: Holistic entity clustering for linked data. In: IEEE ICDMW (2016)Google Scholar
  16. 16.
    Pan, X., Papailiopoulos, D., Oymak, S., Recht, B., Ramchandran, K., Jordan, M.: Parallel correlation clustering on big graphs. In: Advances in Neural Information Processing Systems, pp. 82–90 (2015)Google Scholar
  17. 17.
    Rahm, E.: The case for holistic data integration. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 11–27. Springer, Cham (2016). doi: 10.1007/978-3-319-44039-2_2 CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Database Group, Department of Computer ScienceUniversity of LeipzigLeipzigGermany

Personalised recommendations