Skip to main content

Advertisement

Log in

A machine learning approach to create blocking criteria for record linkage

  • Published:
Health Care Management Science Aims and scope Submit manuscript

Abstract

Record linkage, a part of data cleaning, is recognized as one of most expensive steps in data warehousing. Most record linkage (RL) systems employ a strategy of using blocking filters to reduce the number of pairs to be matched. A blocking filter consists of a number of blocking criteria. Until recently, blocking criteria are selected manually by domain experts. This paper proposes a new method to automatically learn efficient blocking criteria for record linkage. Our method addresses the lack of sufficient labeled data for training. Unlike previous works, we do not consider a blocking filter in isolation but in the context of an accompanying matcher which is employed after the blocking filter. We show that given such a matcher, the labels (assigned to record pairs) that are relevant for learning are the labels assigned by the matcher (link/nonlink), not the labels assigned objectively (match/unmatch). This conclusion allows us to generate an unlimited amount of labeled data for training. We formulate the problem of learning a blocking filter as a Disjunctive Normal Form (DNF) learning problem and use the Probably Approximately Correct (PAC) learning theory to guide the development of algorithm to search for blocking filters. We test the algorithm on a real patient master file of 2.18 million records. The experimental results show that compared with filters obtained by educated guess, the optimal learned filters have comparable recall but reduce throughput (runtime) by an order-of-magnitude factor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. For N = 3 × 106 records, it would take 2.85 years to match all 9×1012 pairs assuming a matching rate of 105 pairs per second.

  2. At N = 3 × 106 and 10% duplication rate (m = 3×105) the probability of matched pair is only 3.3×10−8.

  3. Available at http://hadoop.apache.org

  4. Informally, a matcher is a matching algorithm that outputs dichotomous values link/nolink while a scorer is a matching algorithm that outputs numeric scores which are used in making decision whether to link a pair of records. In this paper the difference is not important, so matcher is used instead of scorer.

  5. There is a special string comparison method where the alignment is irrelevant is the edit distance that measures the difference between two strings by the minimal number of operations needed to transform one string to the other [15]. But edit distance, determined by dynamic programming, is expensive to compute.

  6. Assume that c is the average size of record clusters – groups of records that must be linked together. In other words, each record on average is linked with c others. The total number of matched pairs is c N and the probability that a randomly picked pair is matched is \(\frac {c}{N}\). Except extreme cases, c is often less than 1.

  7. Winlker [24] cites some estimates that up to 90% of work in data warehousing from multiple sources is spent on removing duplicates.

References

  1. Aizenstein H, Pitt L (1995) On the learnability of disjunctive normal form formulas. Mach Learn 19: 183–208

    Google Scholar 

  2. Ricardo B-Y, Berthier R-N (1999) Modern Information Retrieval. Addison Wesley

  3. Baxter R, Christen P, Churches T (2003) A comparison of fast blocking methods for record linkage. In: Proceedings of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pp. 25–27

  4. Blum A, Burch C, Langford J (1998) On learning monotone boolean functions. In: IEEE Symposium on Foundations of Computer Science, pp. 408–415

  5. Bouhaddou O, Bennett J, Cromwell T, Nixon G, Teal J, Davis M , Smith R, Fischetti L , Parker D, Gillen Z, Mattison J (2011) The Department of Veterans Affairs, Department of Defense, and Kaiser Permanente Nationwide Health Information Network exchange in San Diego: patient selection, consent, and identity matching. AMIA Ann Symp Procs 2011: 135–43

    Google Scholar 

  6. Cao Y, Chen Z, Zhu J, Yue P, Lin C-Y, Yu Y (2011) Leveraging unlabeled data to scale blocking for record linkage. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence IJCAI

  7. Fellegi I P, Sunter A B (1969) A theory for record linkage. J Am Stat Assoc 64: 1183–1210

    Article  Google Scholar 

  8. Fleming M, Kirby B, Penny KI (2012) Record linkage in Scotland and its applications to health research. J Clin Nurs 21: 19–20, 2711–21. doi:10.1111/j.1365-2702.2011.04021.x

    Article  Google Scholar 

  9. Gu L, Baxter R A (2004) Adaptive filtering for efficient record linkage. In: Berry MW, Dayal U, Kamath C, Skillicorn D (eds) Proceedings of the section of survey research methods, American Statistical Association. SIAM

  10. Hernandez M A, Stolfo S J (1998) Real-world data is dirty: data cleansing and merge/purge problem. J Data Min Knowl Discov 1: 2

    Google Scholar 

  11. Herzog T N, Scheuren F J, Winkler W E (2007) Data Quality and Record Linkage Techniques. Springer

  12. Jackson J, Lee H, Servedio R, Wan A (2011) Learning random monotone DNF. Discret Appl Math 159(5): 259–271

    Article  Google Scholar 

  13. Joffe E, Byrne M J, Reeder P, Herskovic J R, Johnson C W, McCoy A B, Sittig D F, Bernstam E V (2013) A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation. J Am Med Inform Assoc. doi:10.1136/amiajnl-2013-001744

  14. Kearns M, Li M, Pitt L, Valiant L G (1987) On the learnability of Boolean formulae. In: Aho A V (ed) Proceedings of 19th Annual ACM Symposium on Theory of Computing, (New York, 1987). ACM Press, New York, pp 285–295

    Google Scholar 

  15. Levenshtein V I (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Doklady 10(8): 707–710. Translated from Russian: Doklady Academia Nauk SSSR 163(4) pp. 845–848, 1965

    Google Scholar 

  16. McCallum A, Nigam K, Ungar L (2000) Learning to match and cluster large high-dimensional data sets for data integation. In: Proceedings of the Sixth International Conference on KDD, pp. 169–170

  17. Michelson M, Knoblock C A (2006) Learning blocking schemes for record linkage. In: Proceedings of 21st National Conference on Artificial Intelligence (AAAI-2006), vol 1, pp 440–445

  18. Newcombe H B, Kennedy J M (1962) Record linkage: making maximum use of the discriminating power of identifying information. Commun Assoc Comput Mach 5: 563–567

    Google Scholar 

  19. Servedio R A (2001) On learning monotone DNF under product distributions. In: 14th Annual Conference on Computational Learning Theory, COLT 2001 and 5th European Conference on Computational Learning Theory, EuroCOLT 2001, Amsterdam, The Netherlands, July 2001, vol 2111. Springer, Berlin, pp 558–573

    Google Scholar 

  20. Silveira D P, Artmann E (2009) Accuracy of probabilistic record linkage applied to health databases: systematic review. Rev Saude Publica 43(5): 875–82

    Article  Google Scholar 

  21. Valiant L (2013) Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World. Basic Books

  22. Valiant L G (1984) A theory of the learnable. Commun ACM 27(11): 1134–1142

    Article  Google Scholar 

  23. Wan A (2010) Learning, Cryptography, and the Average Case. PhD thesis. Columbia University, New York

    Google Scholar 

  24. Winkler W E (2006) Overview of record linkage and current research directions. Tech. Rep. 2006-2, Statistical Research Division, U.S. Census Bureau, Washington

  25. Wu C, Walsh A S, Rosenfeld R (2011) Genotype phenotype mapping in RNA viruses - disjunctive normal form learning. Pac Symp Biocomput 16: 62–73

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Phan H. Giang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Giang, P.H. A machine learning approach to create blocking criteria for record linkage. Health Care Manag Sci 18, 93–105 (2015). https://doi.org/10.1007/s10729-014-9276-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10729-014-9276-0

Keywords

Navigation