Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Record Matching

  • Arvind Arasu
  • Josep Domingo-Ferrer
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_594

Synonyms

Deduplication in Data Cleaning; Duplicate detection; Entity resolution; Instance identification; Merge-purge; Name matching; Record linkage

Definition

Record matching is the problem of identifying whether two records in a database refer to the same real-world entity. For example, in Fig. 1, the customer record A1 in Table A and record B1 in Table B probably refer to the same customer, and should therefore be matched. (The example in Fig. 1 was adapted from an example in [ 21].) As Fig. 1 suggests, the same entity can be encoded in different ways in a database; this phenomenon is fairly common and occurs due to a variety of natural reasons such as different formatting conventions, abbreviations, and typographic errors. Record matching is often studied in the following setting: Given two relations A and B, identify all pairs of matching records, one from each relation. For the two tables in Fig. 1, a reasonable output might be the pairs ( A1, B1) and ( A2, B2). In some settings of...
This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Arasu A, Chaudhuri S, Kaushik R Transformation-based framework for record matching. In: Proceedings of the 24th International Conference on Data Engineering; 2008. p. 40–9.Google Scholar
  2. 2.
    Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In: Proceedings of the 32nd International Conference on Very Large Data Bases; 2006. p. 918–29.Google Scholar
  3. 3.
    Bilenko M, Mooney, RJ. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2004. p. 39–48.Google Scholar
  4. 4.
    Chaudhuri S, Chen B.C, Ganti V, Kaushik R. Example-driven design of efficient record matching queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases; 2007. p. 327–38.Google Scholar
  5. 5.
    Chaudhuri S, Ganjam K, Ganti V, Motwani R. Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2003. p. 313–24.Google Scholar
  6. 6.
    Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering; 2006.Google Scholar
  7. 7.
    Cochinwala M, Kurien V, Lalk G, Shasha D. Efficient data reconciliation. Inf Sci. 2001;137(1–4):1–15.zbMATHCrossRefGoogle Scholar
  8. 8.
    Cohen WW. Data integration using similarity joins and a word-based information representation language. ACM Trans Inf Syst. 2000;18(3):288–321.CrossRefGoogle Scholar
  9. 9.
    Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: a survey. IEEE Trans Knowl Data Eng. 2007;19(1):1–16.CrossRefGoogle Scholar
  10. 10.
    Felligi IP, Sunter AB. A theory for record linkage. J Am Stat Soc. 1969;64(328):1183–210.CrossRefGoogle Scholar
  11. 11.
    Hernandez M, Stolfo S. The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1995. p. 127–38.CrossRefGoogle Scholar
  12. 12.
    Jaro MA. Unimatch: a record linkage system: user’s manual. Technical Report. Washington, DC: US Bureau of the Census; 1976.Google Scholar
  13. 13.
    Jaro MA. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa. Florida J Am Stat Assoc. 1989;84(406):414–20.CrossRefGoogle Scholar
  14. 14.
    Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2006. p. 802–3.Google Scholar
  15. 15.
    McCallum A, Nigam K, Ungar LH. Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2000. p. 169–78.Google Scholar
  16. 16.
    Newcombe HB, Kennedy JM, Axford SJ, James AP. Automatic linkage of vital records. Science. 1959;130(3381):954–9.CrossRefGoogle Scholar
  17. 17.
    Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002. p. 269–78.Google Scholar
  18. 18.
    Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2004. p. 743–54.Google Scholar
  19. 19.
    Torra V, Domingo-Ferrer J. Record linkage methods for multidatabase data mining. In: Torra V, editor. Information fusion in data mining. Springer; 2003. p. 101–32.Google Scholar
  20. 20.
    Winkler W. Improved decision rules in the felligi-sunter model of record linkage. Technical Report. Washington, DC: Statistical Research Division/US Bureau of the Census; 1993.Google Scholar
  21. 21.
    Winkler W. The state of record linkage and current research problems. Technical Report. Washington, DC: Statistical Research Division/US Bureau of the Census; 1999.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Microsoft ResearchRedmondUSA
  2. 2.Universitat Rovira i VirgiliTarragona, CataloniaSpain

Section editors and affiliations

  • Venkatesh Ganti
    • 1
  1. 1.Microsoft ResearchMicrosoft CorporationRedmondUSA