A Clustering-Based Framework for Incrementally Repairing Entity Resolution

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9652)

Abstract

Although entity resolution (ER) is known to be an important problem that has wide-spread applications in many areas, including e-commerce, health-care, social science, and crime and fraud detection, one aspect that has largely been neglected is to monitor the quality of entity resolution and repair erroneous matching decisions over time. In this paper we develop an efficient method for incrementally repairing ER, i.e., fix detected erroneous matches and non-matches. Our method is based on an efficient clustering algorithm that eliminates inconsistencies among matching decisions, and an efficient provenance indexing data structure that allows us to trace the evidence of clustering for supporting ER repairing. We have evaluated our method over real-world databases, and our experimental results show that the quality of entity resolution can be significantly improved through repairing over time.

Keywords

Data matching Record linkage Deduplication Data provenance Data repairing Consistent clustering 

References

  1. 1.
    Afrati, F.N., Kolaitis, P.G.: Repair checking in inconsistent databases: algorithms and complexity. In: ICDT, pp. 31–41 (2009)Google Scholar
  2. 2.
    Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: ICDE, pp. 952–963 (2009)Google Scholar
  3. 3.
    Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Mach. Learn. 56(1–3), 89–113 (2004)CrossRefMATHGoogle Scholar
  4. 4.
    Barnes, M.: A practioner’s guide to evaluating entity resolution results (2014)Google Scholar
  5. 5.
    Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  6. 6.
    Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: PVLDB, pp. 315–326 (2007)Google Scholar
  7. 7.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE TKDE 19, 1–16 (2007)Google Scholar
  8. 8.
    Fellegi, I., Sunter, A.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)CrossRefMATHGoogle Scholar
  9. 9.
    Fisher, J., Christen, P., Wang, Q., Rahm, E.: A clustering-based framework to control block sizes for entity resolution. In: KDD, pp. 279–288 (2015)Google Scholar
  10. 10.
    Schewe, K.-D., Wang, Q.: A theoretical framework for knowledge-based entity resolution. TCS 549, 101–126 (2014)MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In: AAAI, pp. 862–867 (2005)Google Scholar
  12. 12.
    Shen, Z., Wang, Q.: Entity resolution with weighted constraints. In: Manolopoulos, Y., Trajcevski, G., Kon-Popovska, M. (eds.) ADBIS 2014. LNCS, vol. 8716, pp. 308–322. Springer, Heidelberg (2014)Google Scholar
  13. 13.
    Wang, Q., Schewe, K.-D., Wang, W.: Provenance-aware entity resolution: leveraging provenance to improve quality. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M.A. (eds.) DASFAA 2015. LNCS, vol. 9049, pp. 474–490. Springer, Heidelberg (2015)Google Scholar
  14. 14.
    Wang, Q., Vatsalan, D., Christen, P.: Efficient interactive training selection for large-scale entity resolution. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS, vol. 9078, pp. 562–573. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  15. 15.
    Whang, S.E., Garcia-Molina, H.: Entity resolution with evolving rules. VLDB 3(1–2), 1326–1337 (2010)Google Scholar
  16. 16.
    Wijsen, J.: Database repairing using updates. TODS 30(3), 722–768 (2005)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Research School of Computer ScienceThe Australian National UniversityCanberraAustralia

Personalised recommendations