Skip to main content

Record Linkage

Encyclopedia of Machine Learning and Data Mining

Abstract

Many data mining and machine learning projects require information from various data sources to be integrated and linked before they can be used for further analysis. A crucial task of such data integration is to identify which records refer to the same real-world entities across databases when no common entity identifiers are available and when records can contain errors and variations. This process of record linkage therefore has to rely upon the attributes that are available in the databases to be linked. For databases that contain personal information, for example, of customers, taxpayers, or patients, these are commonly their names, addresses, phone numbers, and dates of birth.To improve the scalability of the linkage process, blocking or indexing techniques are commonly applied to limit the comparison of records to pairs or groups that likely correspond to the same entity. Records are compared using a variety of comparison functions, most commonly approximate string comparators that account for typographical errors and variations in textual attributes. The compared records are then classified into matches, non-matches, and potential matches, depending upon the decision model used. If training data in the form of true matches and non-matches are available, supervised classification techniques can be employed. However, in many practical record linkage applications, no ground truth data are available, and therefore unsupervised approaches are required. An approach known as probabilistic record linkage is commonly employed. In this article we provide an overview of record linkage with an emphasis on the classification aspects of this process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Recommended Reading

  • Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. ACM Trans Knowl Discov Data 1(1), 5-es, pp 1–35

    Google Scholar 

  • Bloothooft G, Christen P, Mandemakers K, Schraagen M (2015) Population reconstruction. Springer, Cham

    Book  Google Scholar 

  • Christen P (2012) Data matching – concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications. Springer, Berlin/New York

    Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 19:380–393

    MathSciNet  MATH  Google Scholar 

  • Dunn H (1946) Record linkage. Am J Publ Health 36(12):1412

    Article  Google Scholar 

  • Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210

    Article  MATH  Google Scholar 

  • Ferrante A, Boyd J (2012) A transparent and transportable methodology for evaluating data linkage software. J Biomed Inf 45(1):165–172

    Article  Google Scholar 

  • Herzog TN, Scheuren FJ, Winkler WE (2007) Data quality and record linkage techniques. Springer, New York/London

    MATH  Google Scholar 

  • Herzog TN, Scheuren FJ, Winkler WE (2010) Record linkage. Wiley Interdiscip Rev Comput Stat 2(5): 535–543

    Article  MATH  Google Scholar 

  • Kum HC, Krishnamurthy A, Machanavajjhala A, Ahalt SC (2014) Social genome: putting big data to work for population informatics. IEEE Comput 47(1):56–63

    Article  Google Scholar 

  • Lahiri P, Larsen M (2005) Regression analysis with linked data. J Am Stat Assoc 100:222–230

    Article  MathSciNet  MATH  Google Scholar 

  • Larsen MD, Rubin DB (2001) Iterative automated record linkage using mixture models. J Am Stat Assoc 96(453):32–41

    Article  MathSciNet  Google Scholar 

  • Li P, Dong XL, Maurino A, Srivastava D (2011) Linking temporal records. The VLDB conference was in Seattle, WA. In: Proceedings of the VLDB endowment, Seattle, vol 4, issue 11

    Google Scholar 

  • Newcombe H, Kennedy J, Axford S, James A (1959) Automatic linkage of vital records. Science 130(3381):954–959

    Article  Google Scholar 

  • On BW, Koudas N, Lee D, Srivastava D (2007) Group linkage. In: IEEE international conference on data engineering, Istanbul, pp 496–505

    Google Scholar 

  • Ramadan B, Christen P, Liang H, Gayler RW (2015) Dynamic sorted neighborhood indexing for real time entity resolution. ACM J Data Inf Qual 6(4):15

    Google Scholar 

  • Vatsalan D, Christen P, Verykios VS (2013) A taxonomy of privacy-preserving record linkage techniques. Elsevier Inf Syst 38(6):946–969

    Article  Google Scholar 

  • Winkler WE (1988) Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. The American Statistical Association that is located in Alexandria, VA publishes the proceedings. In: Proceedings of the section on survey research methods, New Orleans, Washington, pp 667–671

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter Christen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media New York

About this entry

Cite this entry

Christen, P., Winkler, W.E. (2016). Record Linkage. In: Sammut, C., Webb, G. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7502-7_712-1

Download citation

  • DOI: https://doi.org/10.1007/978-1-4899-7502-7_712-1

  • Received:

  • Accepted:

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Online ISBN: 978-1-4899-7502-7

  • eBook Packages: Springer Reference Computer SciencesReference Module Computer Science and Engineering

Publish with us

Policies and ethics

Chapter history

  1. Latest

    Record Linkage
    Published:
    25 March 2023

    DOI: https://doi.org/10.1007/978-1-4899-7502-7_712-2

  2. Original

    Record Linkage
    Published:
    17 June 2016

    DOI: https://doi.org/10.1007/978-1-4899-7502-7_712-1