A machine learning approach to create blocking criteria for record linkage

Giang, Phan H.

doi:10.1007/s10729-014-9276-0

A machine learning approach to create blocking criteria for record linkage

Published: 29 April 2014

Volume 18, pages 93–105, (2015)
Cite this article

Health Care Management Science Aims and scope Submit manuscript

Phan H. Giang¹

576 Accesses
11 Citations
Explore all metrics

Abstract

Record linkage, a part of data cleaning, is recognized as one of most expensive steps in data warehousing. Most record linkage (RL) systems employ a strategy of using blocking filters to reduce the number of pairs to be matched. A blocking filter consists of a number of blocking criteria. Until recently, blocking criteria are selected manually by domain experts. This paper proposes a new method to automatically learn efficient blocking criteria for record linkage. Our method addresses the lack of sufficient labeled data for training. Unlike previous works, we do not consider a blocking filter in isolation but in the context of an accompanying matcher which is employed after the blocking filter. We show that given such a matcher, the labels (assigned to record pairs) that are relevant for learning are the labels assigned by the matcher (link/nonlink), not the labels assigned objectively (match/unmatch). This conclusion allows us to generate an unlimited amount of labeled data for training. We formulate the problem of learning a blocking filter as a Disjunctive Normal Form (DNF) learning problem and use the Probably Approximately Correct (PAC) learning theory to guide the development of algorithm to search for blocking filters. We test the algorithm on a real patient master file of 2.18 million records. The experimental results show that compared with filters obtained by educated guess, the optimal learned filters have comparable recall but reduce throughput (runtime) by an order-of-magnitude factor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

For N = 3 × 10⁶ records, it would take 2.85 years to match all 9×10¹² pairs assuming a matching rate of 10⁵ pairs per second.
At N = 3 × 10⁶ and 10% duplication rate (m = 3×10⁵) the probability of matched pair is only 3.3×10⁻⁸.
Available at http://hadoop.apache.org
Informally, a matcher is a matching algorithm that outputs dichotomous values link/nolink while a scorer is a matching algorithm that outputs numeric scores which are used in making decision whether to link a pair of records. In this paper the difference is not important, so matcher is used instead of scorer.
There is a special string comparison method where the alignment is irrelevant is the edit distance that measures the difference between two strings by the minimal number of operations needed to transform one string to the other [15]. But edit distance, determined by dynamic programming, is expensive to compute.
Assume that c is the average size of record clusters – groups of records that must be linked together. In other words, each record on average is linked with c others. The total number of matched pairs is c N and the probability that a randomly picked pair is matched is \(\frac {c}{N}\). Except extreme cases, c is often less than 1.
Winlker [24] cites some estimates that up to 90% of work in data warehousing from multiple sources is spent on removing duplicates.

References

Aizenstein H, Pitt L (1995) On the learnability of disjunctive normal form formulas. Mach Learn 19: 183–208
Google Scholar
Ricardo B-Y, Berthier R-N (1999) Modern Information Retrieval. Addison Wesley
Baxter R, Christen P, Churches T (2003) A comparison of fast blocking methods for record linkage. In: Proceedings of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pp. 25–27
Blum A, Burch C, Langford J (1998) On learning monotone boolean functions. In: IEEE Symposium on Foundations of Computer Science, pp. 408–415
Bouhaddou O, Bennett J, Cromwell T, Nixon G, Teal J, Davis M , Smith R, Fischetti L , Parker D, Gillen Z, Mattison J (2011) The Department of Veterans Affairs, Department of Defense, and Kaiser Permanente Nationwide Health Information Network exchange in San Diego: patient selection, consent, and identity matching. AMIA Ann Symp Procs 2011: 135–43
Google Scholar
Cao Y, Chen Z, Zhu J, Yue P, Lin C-Y, Yu Y (2011) Leveraging unlabeled data to scale blocking for record linkage. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence IJCAI
Fellegi I P, Sunter A B (1969) A theory for record linkage. J Am Stat Assoc 64: 1183–1210
Article Google Scholar
Fleming M, Kirby B, Penny KI (2012) Record linkage in Scotland and its applications to health research. J Clin Nurs 21: 19–20, 2711–21. doi:10.1111/j.1365-2702.2011.04021.x
Article Google Scholar
Gu L, Baxter R A (2004) Adaptive filtering for efficient record linkage. In: Berry MW, Dayal U, Kamath C, Skillicorn D (eds) Proceedings of the section of survey research methods, American Statistical Association. SIAM
Hernandez M A, Stolfo S J (1998) Real-world data is dirty: data cleansing and merge/purge problem. J Data Min Knowl Discov 1: 2
Google Scholar
Herzog T N, Scheuren F J, Winkler W E (2007) Data Quality and Record Linkage Techniques. Springer
Jackson J, Lee H, Servedio R, Wan A (2011) Learning random monotone DNF. Discret Appl Math 159(5): 259–271
Article Google Scholar
Joffe E, Byrne M J, Reeder P, Herskovic J R, Johnson C W, McCoy A B, Sittig D F, Bernstam E V (2013) A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation. J Am Med Inform Assoc. doi:10.1136/amiajnl-2013-001744
Kearns M, Li M, Pitt L, Valiant L G (1987) On the learnability of Boolean formulae. In: Aho A V (ed) Proceedings of 19th Annual ACM Symposium on Theory of Computing, (New York, 1987). ACM Press, New York, pp 285–295
Google Scholar
Levenshtein V I (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Doklady 10(8): 707–710. Translated from Russian: Doklady Academia Nauk SSSR 163(4) pp. 845–848, 1965
Google Scholar
McCallum A, Nigam K, Ungar L (2000) Learning to match and cluster large high-dimensional data sets for data integation. In: Proceedings of the Sixth International Conference on KDD, pp. 169–170
Michelson M, Knoblock C A (2006) Learning blocking schemes for record linkage. In: Proceedings of 21st National Conference on Artificial Intelligence (AAAI-2006), vol 1, pp 440–445
Newcombe H B, Kennedy J M (1962) Record linkage: making maximum use of the discriminating power of identifying information. Commun Assoc Comput Mach 5: 563–567
Google Scholar
Servedio R A (2001) On learning monotone DNF under product distributions. In: 14th Annual Conference on Computational Learning Theory, COLT 2001 and 5th European Conference on Computational Learning Theory, EuroCOLT 2001, Amsterdam, The Netherlands, July 2001, vol 2111. Springer, Berlin, pp 558–573
Google Scholar
Silveira D P, Artmann E (2009) Accuracy of probabilistic record linkage applied to health databases: systematic review. Rev Saude Publica 43(5): 875–82
Article Google Scholar
Valiant L (2013) Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World. Basic Books
Valiant L G (1984) A theory of the learnable. Commun ACM 27(11): 1134–1142
Article Google Scholar
Wan A (2010) Learning, Cryptography, and the Average Case. PhD thesis. Columbia University, New York
Google Scholar
Winkler W E (2006) Overview of record linkage and current research directions. Tech. Rep. 2006-2, Statistical Research Division, U.S. Census Bureau, Washington
Wu C, Walsh A S, Rosenfeld R (2011) Genotype phenotype mapping in RNA viruses - disjunctive normal form learning. Pac Symp Biocomput 16: 62–73
Google Scholar

Download references

Author information

Authors and Affiliations

George Mason University Fairfax, 4400 University Dr. Fairfax, Fairfax, VA, 22030, USA
Phan H. Giang

Authors

Phan H. Giang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Phan H. Giang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Giang, P.H. A machine learning approach to create blocking criteria for record linkage. Health Care Manag Sci 18, 93–105 (2015). https://doi.org/10.1007/s10729-014-9276-0

Download citation

Received: 29 July 2013
Accepted: 21 February 2014
Published: 29 April 2014
Issue Date: March 2015
DOI: https://doi.org/10.1007/s10729-014-9276-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A machine learning approach to create blocking criteria for record linkage

Abstract

Access this article

Similar content being viewed by others

A Review of Unsupervised and Semi-supervised Blocking Methods for Record Linkage

A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage

Novel Blocking Techniques and Distance Metrics for Record Linkage

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A machine learning approach to create blocking criteria for record linkage

Abstract

Access this article

Similar content being viewed by others

A Review of Unsupervised and Semi-supervised Blocking Methods for Record Linkage

A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage

Novel Blocking Techniques and Distance Metrics for Record Linkage

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation