Encyclopedia of Database Systems

Living Edition
| Editors: Ling Liu, M. Tamer Özsu

Record Matching

Living reference work entry
DOI: https://doi.org/10.1007/978-1-4899-7993-3_594-2

Synonyms

Definition

Record matching is the problem of identifying whether two records in a database refer to the same real-world entity. For example, in Fig. 1, the customer record A1 in Table A and record B1 in Table B probably refer to the same customer, and should therefore be matched. (The example in Fig. 1 was adapted from an example in [ 21].) As Fig. 1 suggests, the same entity can be encoded in different ways in a database; this phenomenon is fairly common and occurs due to a variety of natural reasons such as different formatting conventions, abbreviations, and typographic errors. Record matching is often studied in the following setting: Given two relations A and B, identify all pairs of matching records, one from each relation. For the two tables in Fig. 1, a reasonable output might be the pairs ( A1, B1) and ( A2, B2). In some settings of the record...
This is a preview of subscription content, log in to check access

Recommended Reading

  1. 1.
    Arasu A, Chaudhuri S, Kaushik R Transformation-based framework for record matching. In: Proceedings of the 24th international conference on data engineering. 2008. p. 40–9.Google Scholar
  2. 2.
    Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In: Proceedings of the 32nd international conference on very large data bases. 2006. p. 918–29.Google Scholar
  3. 3.
    Bilenko M, Mooney, RJ. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 10th ACM SIGKDD internationl conference on knowledge discovery and data mining. 2003. p. 39–48.Google Scholar
  4. 4.
    Chaudhuri S, Chen B.C, Ganti V, Kaushik R. Example-driven design of efficient record matching queries. In: Proceedings of the 33rd international conference on very large data bases. 2007. p. 327–38.Google Scholar
  5. 5.
    Chaudhuri S, Ganjam K, Ganti V, Motwani R. Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the ACM SIGMOD intrenational conference on management of data. 2003. p. 313–24.Google Scholar
  6. 6.
    Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering. 2006.Google Scholar
  7. 7.
    Cochinwala M, Kurien V, Lalk G, Shasha D. Efficient data reconciliation. Inf Sci. 2001;137(1–4):1–15.CrossRefMATHGoogle Scholar
  8. 8.
    Cohen WW. Data integration using similarity joins and a word-based information representation language. ACM Trans Inf Syst. 2000;18(3):288–321.MathSciNetCrossRefGoogle Scholar
  9. 9.
    Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: a survey. IEEE Trans Knowl Data Eng. 2007;19(1):1–16.CrossRefGoogle Scholar
  10. 10.
    Felligi IP, Sunter AB. A theory for record linkage. J Am Stat Soc. 1969;64(328):1183–210.CrossRefGoogle Scholar
  11. 11.
    Hernandez M, Stolfo S. The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data. 1995. p. 127–38.Google Scholar
  12. 12.
    Jaro MA. Unimatch: a record linkage system: user’s manual. Technical Report. Washington, DC: US Bureau of the Census; 1976.Google Scholar
  13. 13.
    Jaro MA. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa. Florida J Am Stat Assoc. 1989;84(406):414–20.CrossRefGoogle Scholar
  14. 14.
    Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of the ACM SIGMOD international conference on management of data. 2006. p. 802–3.Google Scholar
  15. 15.
    McCallum A, Nigam K, Ungar LH. Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD interantionl conference on knowledge discovery and data mining. 2000. p. 169–78.Google Scholar
  16. 16.
    Newcombe HB, Kennedy JM, Axford SJ, James AP. Automatic linkage of vital records. Science. 1959;130:954–9.CrossRefGoogle Scholar
  17. 17.
    Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining. 2002. p. 269–78.Google Scholar
  18. 18.
    Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD international conference on management of data. 2004. p. 743–54.Google Scholar
  19. 19.
    Torra V, Domingo-Ferrer J. Record linkage methods for multidatabase data mining. In: Torra V, editor. Information fusion in data mining. Springer; 2003. p. 101–32.Google Scholar
  20. 20.
    Winkler W. Improved decision rules in the felligi-sunter model of record linkage. Technical Report. Washington, DC: Statistical Research Division/US Bureau of the Census; 1993.Google Scholar
  21. 21.
    Winkler W. The state of record linkage and current research problems. Technical Report. Washington, DC: Statistical Research Division/US Bureau of the Census; 1999.Google Scholar

Copyright information

© Springer Science+Business Media LLC 2016

Authors and Affiliations

  1. 1.Microsoft ResearchRedmondUSA
  2. 2.Universitat Rovira i VirgiliTarragona, CataloniaSpain