The VLDB Journal

, 18:1261 | Cite as

Generic entity resolution with negative rules

  • Steven Euijong Whang
  • Omar Benjelloun
  • Hector Garcia-Molina
Regular Paper

Abstract

Entity resolution (ER) (also known as deduplication or merge-purge) is a process of identifying records that refer to the same real-world entity and merging them together. In practice, ER results may contain “inconsistencies,” either due to mistakes by the match and merge function writers or changes in the application semantics. To remove the inconsistencies, we introduce “negative rules” that disallow inconsistencies in the ER solution (ER-N). A consistent solution is then derived based on the guidance from a domain expert. The inconsistencies can be resolved in several ways, leading to accurate solutions. We formalize ER-N, treating the match, merge, and negative rules as black boxes, which permits expressive and extensible ER-N solutions. We identify important properties for the rules that, if satisfied, enable less costly ER-N. We develop and evaluate two algorithms that find an ER-N solution based on guidance from the domain expert: the GNR algorithm that does not assume the properties and the ENR algorithm that exploits the properties.

Keywords

Generic entity resolution Inconsistency Negative rule Data cleaning 

References

  1. 1.
    Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-swoosh: A family of algorithms for generic, distributed entity resolution. In: ICDCS (2007)Google Scholar
  2. 2.
    Benjelloun, O., Garcia-Molina, H., Menestrina, D., Whang, S.E., Su, Q., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. (2008). doi:10.1007/s00778-008-0098-x
  3. 3.
    Bhattacharya, I., Getoor, L.: Relational clustering for multi-type entity resolution. In: MRDM ’05: Proceedings of the 4th international workshop on multi-relational mining, pp. 3–12. ACM Press, New York (2005). doi:10.1145/1090193.1090195
  4. 4.
    Bohannon, P., Flaster, M., Fan, W., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD Conference, pp. 143–154 (2005)Google Scholar
  5. 5.
    Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: Proceedings of ICDE. Tokyo, Japan (2005)Google Scholar
  6. 6.
    Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD’07: Proceedings of the 2007 ACM SIGMOD international conference on management of data, pp. 437–448. ACM Press, New York (2007). doi:10.1145/1247480.1247530
  7. 7.
    Chomicki, J., Marcinkowski, J.: On the computational complexity of minimal-change integrity maintenance in relational databases. In: Inconsistency Tolerance, pp. 119–150 (2005)Google Scholar
  8. 8.
    Doan, A., Lu, Y., Lee, Y., Han, J.: Object matching for information integration: A profiler-based approach. In: IIWeb, pp. 53–58 (2003)Google Scholar
  9. 9.
    Doan A., Lu Y., Lee Y., Han J.: Profile-based object matching for information integration. IEEE Intell. Syst. 18(5), 54–59 (2003)CrossRefGoogle Scholar
  10. 10.
    Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD Conference, pp. 85–96 (2005)Google Scholar
  11. 11.
    Elmagarmid A.K., Ipeirotis P.G., Verykios V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  12. 12.
    Eswaran, K.P., Chamberlin, D.D.: Functional specifications of subsystem for database integrity. In: VLDB, pp. 48–68 (1975)Google Scholar
  13. 13.
    Fellegi, I.P., Holt, D.: A systematic approach to automatic edit and imputation. J Am. Stat. Assoc. 71(353), 17–35 (1976). http://www.jstor.org/stable/2285726 Google Scholar
  14. 14.
    Franconi, E., Palma, A.L., Leone, N., Perri, S., Scarcello, F.: Census data repair: A challenging application of disjunctive logic programming. In: LPAR’01: Proceedings of the Artificial Intelligence on Logic for Programming, pp. 561–578. Springer, London (2001)Google Scholar
  15. 15.
    Genesereth M.R., Nilsson N.J.: Logical Foundations of Artificial Intelligence. Morgan Kaufmann, Palo Alto (1988)Google Scholar
  16. 16.
    Ginsberg M.L.: Readings in Nonmonotonic Reasoning. Morgan Kaufmann, Los Altos (1987)Google Scholar
  17. 17.
    Gu, L., Baxter, R., Vickers, D., Rainsford, C.: Record linkage: Current practice and future directions. Tech. Rep. 03/83, CSIRO Mathematical and Information Sciences (2003)Google Scholar
  18. 18.
    Hammer, M., McLeod, D.: Semantic integrity in a relational data base system. In: VLDB, pp. 25–47 (1975)Google Scholar
  19. 19.
    Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD, pp. 127–138 (1995)Google Scholar
  20. 20.
    Jaro M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)CrossRefGoogle Scholar
  21. 21.
    McCallum, A.K., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of KDD, pp. 169–178. Boston, MA (2000)Google Scholar
  22. 22.
    Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: First International VLDB Workshop on Clean Databases. Seoul, Korea (2006)Google Scholar
  23. 23.
    Monge, A.E., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 23–29 (1997)Google Scholar
  24. 24.
    Newcombe H.B., Kennedy J.M., Axford S.J., James A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)CrossRefGoogle Scholar
  25. 25.
    Nilsson N.J.: Artificial Intelligence A New Synthesis. Morgan Kaufmann, San Francisco (1998)Google Scholar
  26. 26.
    Rowland, T.: Connected component. http://mathwold.wolfram.com/ConnectedComponent.html
  27. 27.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD. Edmonton, Alberta (2002)Google Scholar
  28. 28.
    Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In: AAAI, pp. 862–867 (2005)Google Scholar
  29. 29.
    Tejada S., Knoblock C.A., Minton S.: Learning object identification rules for information integration. Inf. Syst. J. 26(8), 635–656 (2001)CrossRefGoogle Scholar
  30. 30.
    Whang, S.E., Benjelloun, O., Garcia-Molina, H.: Additional experiments on negative rules. Tech. rep., Stanford University. http://dbpubs.stanford.edu/pub/2005-5
  31. 31.
    Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. Tech. rep., Stanford University (2008). http://dbpubs.stanford.edu/pub/2008-19
  32. 32.
    Widom, J., Ceri, S. (eds.): Active Database Systems: Triggers and Rules For Advanced Database Processing. Morgan Kaufmann, San Francisco (1996)Google Scholar
  33. 33.
    Winkler, W.: Overview of record linkage and current research directions. Tech. rep., Statistical Research Division, US Bureau of the Census, Washington, DC (2006)Google Scholar
  34. 34.
    Winkler, W.E.: State of statistical data editing and current research problems. In: UN/ECE Work Session on Statistical Data Editing, Working Paper n.29, pp. 2–4 (1999)Google Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  • Steven Euijong Whang
    • 1
  • Omar Benjelloun
    • 2
  • Hector Garcia-Molina
    • 1
  1. 1.Computer Science DepartmentStanford UniversityStanfordUSA
  2. 2.Google Inc.Mountain ViewUSA

Personalised recommendations