Provenance-Aware Entity Resolution: Leveraging Provenance to Improve Quality

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9049)

Abstract

Entity resolution (ER) - the process of identifying records that refer to the same real-world entity - pervasively exists in many application areas. Nevertheless, resolving entities is hardly ever completely accurate. In this paper, we investigate a provenance-aware framework for ER. We first propose an indexing structure that can be efficiently built for provenance storage in support of an ER process. Then a generic repairing strategy, called coordinate-split-merge (CSM), is developed to control the interaction between repairs driven by must-link and cannot-link constraints. Our experimental results show that the proposed indexing structure is efficient for capturing the provenance of ER both in time and space, which is also linearly scalable over the number of matches. Our repairing algorithms can significantly reduce human efforts in leveraging the provenance of ER for identifying erroneous matches.

Keywords

Entity resolution Data matching Record linkage Deduplication Data provenance Repair Indexing structure 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley (1995)Google Scholar
  2. 2.
    Afrati, F.N., Kolaitis, P.G.: Repair checking in inconsistent databases: algorithms and complexity. In: ICDT, pp. 31–41 (2009)Google Scholar
  3. 3.
    Agrawal, P., Ikeda, R., Park, H., Widom, J.: Trio-ER: The Trio system as a workbench for entity-resolution. Technical report, Stanford InfoLab (2009)Google Scholar
  4. 4.
    Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD, pp. 783–794 (2010)Google Scholar
  5. 5.
    Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: ICDE, pp. 952–963 (2009)Google Scholar
  6. 6.
    Benjelloun, O., Sarma, A.D., Halevy, A., Theobald, M., Widom, J.: Databases with uncertainty and lineage. The VLDB Journal 17(2), 243–264 (2008)CrossRefGoogle Scholar
  7. 7.
    Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1), 5 (2007)CrossRefGoogle Scholar
  8. 8.
    Buneman, P., Khanna, S., Tan, W.-C.: Why and where: a characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2000) CrossRefGoogle Scholar
  9. 9.
    Buneman, P., Tan, W.-C.: Provenance in databases. In: SIGMOD, pp. 1171–1173 (2007)Google Scholar
  10. 10.
    Chaudhuri, S., Das Sarma, A., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD, pp. 437–448 (2007)Google Scholar
  11. 11.
    Christen, P.: Data Matching. Springer (2012)Google Scholar
  12. 12.
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE 24(9), 1537–1555 (2012)Google Scholar
  13. 13.
    Cohen, W.: Data integration using similarity joins and a word-based information representation language. TOIS 18(3), 288–321 (2000)CrossRefGoogle Scholar
  14. 14.
    Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: PVLDB, pp. 315–326 (2007)Google Scholar
  15. 15.
    Fellegi, I., Sunter, A.: A theory for record linkage. J. Amer. Statistical Assoc. 64(328), 1183–1210 (1969)CrossRefGoogle Scholar
  16. 16.
    Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann (2006)Google Scholar
  17. 17.
    Karvounarakis, G., Ives, Z.G., Tannen, V.: Querying data provenance. In: SIGMOD, pp. 951–962 (2010)Google Scholar
  18. 18.
    Newcombe, H., Kennedy, J.: Record linkage: making maximum use of the discriminating power of identifying information. Comm. of the ACM 5(11)Google Scholar
  19. 19.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD (2002)Google Scholar
  20. 20.
    Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In: AAAI, pp. 862–867 (2005)Google Scholar
  21. 21.
    Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. ACM SIGMOD Record 34(3), 31–36 (2005)CrossRefGoogle Scholar
  22. 22.
    Verykios, V., Moustakides, G., Elfeky, M.: A Bayesian decision model for cost optimal record matching. The VLDB Journal 12(1), 28–40 (2003)CrossRefGoogle Scholar
  23. 23.
    Wagstaff, K., Cardie, C.: Clustering with instance-level constraints. In: AAAI, pp. 1097 (2000)Google Scholar
  24. 24.
    Whang, S.E., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with negative rules. The VLDB Journal 18(6), 1261–1277 (2009)CrossRefGoogle Scholar
  25. 25.
    Wijsen, J.: Database repairing using updates. TODS 30(3), 722–768 (2005)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Research School of Computer ScienceAustralian National UniversityCanberraAustralia
  2. 2.Software Competence Center Hagenberg and Johannes-Kepler-University LinzLinzAustria
  3. 3.Alcatel-Lucent BeijingBeijingChina

Personalised recommendations