Skip to main content
Log in

Generic entity resolution with negative rules

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Entity resolution (ER) (also known as deduplication or merge-purge) is a process of identifying records that refer to the same real-world entity and merging them together. In practice, ER results may contain “inconsistencies,” either due to mistakes by the match and merge function writers or changes in the application semantics. To remove the inconsistencies, we introduce “negative rules” that disallow inconsistencies in the ER solution (ER-N). A consistent solution is then derived based on the guidance from a domain expert. The inconsistencies can be resolved in several ways, leading to accurate solutions. We formalize ER-N, treating the match, merge, and negative rules as black boxes, which permits expressive and extensible ER-N solutions. We identify important properties for the rules that, if satisfied, enable less costly ER-N. We develop and evaluate two algorithms that find an ER-N solution based on guidance from the domain expert: the GNR algorithm that does not assume the properties and the ENR algorithm that exploits the properties.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-swoosh: A family of algorithms for generic, distributed entity resolution. In: ICDCS (2007)

  2. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Whang, S.E., Su, Q., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. (2008). doi:10.1007/s00778-008-0098-x

  3. Bhattacharya, I., Getoor, L.: Relational clustering for multi-type entity resolution. In: MRDM ’05: Proceedings of the 4th international workshop on multi-relational mining, pp. 3–12. ACM Press, New York (2005). doi:10.1145/1090193.1090195

  4. Bohannon, P., Flaster, M., Fan, W., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD Conference, pp. 143–154 (2005)

  5. Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: Proceedings of ICDE. Tokyo, Japan (2005)

  6. Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD’07: Proceedings of the 2007 ACM SIGMOD international conference on management of data, pp. 437–448. ACM Press, New York (2007). doi:10.1145/1247480.1247530

  7. Chomicki, J., Marcinkowski, J.: On the computational complexity of minimal-change integrity maintenance in relational databases. In: Inconsistency Tolerance, pp. 119–150 (2005)

  8. Doan, A., Lu, Y., Lee, Y., Han, J.: Object matching for information integration: A profiler-based approach. In: IIWeb, pp. 53–58 (2003)

  9. Doan A., Lu Y., Lee Y., Han J.: Profile-based object matching for information integration. IEEE Intell. Syst. 18(5), 54–59 (2003)

    Article  Google Scholar 

  10. Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD Conference, pp. 85–96 (2005)

  11. Elmagarmid A.K., Ipeirotis P.G., Verykios V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  12. Eswaran, K.P., Chamberlin, D.D.: Functional specifications of subsystem for database integrity. In: VLDB, pp. 48–68 (1975)

  13. Fellegi, I.P., Holt, D.: A systematic approach to automatic edit and imputation. J Am. Stat. Assoc. 71(353), 17–35 (1976). http://www.jstor.org/stable/2285726

    Google Scholar 

  14. Franconi, E., Palma, A.L., Leone, N., Perri, S., Scarcello, F.: Census data repair: A challenging application of disjunctive logic programming. In: LPAR’01: Proceedings of the Artificial Intelligence on Logic for Programming, pp. 561–578. Springer, London (2001)

  15. Genesereth M.R., Nilsson N.J.: Logical Foundations of Artificial Intelligence. Morgan Kaufmann, Palo Alto (1988)

    Google Scholar 

  16. Ginsberg M.L.: Readings in Nonmonotonic Reasoning. Morgan Kaufmann, Los Altos (1987)

    Google Scholar 

  17. Gu, L., Baxter, R., Vickers, D., Rainsford, C.: Record linkage: Current practice and future directions. Tech. Rep. 03/83, CSIRO Mathematical and Information Sciences (2003)

  18. Hammer, M., McLeod, D.: Semantic integrity in a relational data base system. In: VLDB, pp. 25–47 (1975)

  19. Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD, pp. 127–138 (1995)

  20. Jaro M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)

    Article  Google Scholar 

  21. McCallum, A.K., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of KDD, pp. 169–178. Boston, MA (2000)

  22. Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: First International VLDB Workshop on Clean Databases. Seoul, Korea (2006)

  23. Monge, A.E., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 23–29 (1997)

  24. Newcombe H.B., Kennedy J.M., Axford S.J., James A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)

    Article  Google Scholar 

  25. Nilsson N.J.: Artificial Intelligence A New Synthesis. Morgan Kaufmann, San Francisco (1998)

  26. Rowland, T.: Connected component. http://mathwold.wolfram.com/ConnectedComponent.html

  27. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD. Edmonton, Alberta (2002)

  28. Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In: AAAI, pp. 862–867 (2005)

  29. Tejada S., Knoblock C.A., Minton S.: Learning object identification rules for information integration. Inf. Syst. J. 26(8), 635–656 (2001)

    Article  Google Scholar 

  30. Whang, S.E., Benjelloun, O., Garcia-Molina, H.: Additional experiments on negative rules. Tech. rep., Stanford University. http://dbpubs.stanford.edu/pub/2005-5

  31. Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. Tech. rep., Stanford University (2008). http://dbpubs.stanford.edu/pub/2008-19

  32. Widom, J., Ceri, S. (eds.): Active Database Systems: Triggers and Rules For Advanced Database Processing. Morgan Kaufmann, San Francisco (1996)

  33. Winkler, W.: Overview of record linkage and current research directions. Tech. rep., Statistical Research Division, US Bureau of the Census, Washington, DC (2006)

  34. Winkler, W.E.: State of statistical data editing and current research problems. In: UN/ECE Work Session on Statistical Data Editing, Working Paper n.29, pp. 2–4 (1999)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Steven Euijong Whang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Whang, S.E., Benjelloun, O. & Garcia-Molina, H. Generic entity resolution with negative rules. The VLDB Journal 18, 1261–1277 (2009). https://doi.org/10.1007/s00778-009-0136-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-009-0136-3

Keywords

Navigation