Skip to main content

Record Linkage Methodology and Applications

  • Chapter
  • First Online:
Book cover Handbook of Data Intensive Computing

Abstract

As information technology advances rapidly and Internet blooms, a lot of business tends to electronization and globalization. Individuals and organizations have more channels or methods to expose information and gather information. The result is that individuals and organizations face the increasing challenges to process the large volumes of data and find the relevant quality information to fit their specific business needs. In addition, the data gathered from multiple resources usually contains errors and duplicate information. There is a strong need to detect duplicates and remove them in data preparation phase before performing advanced data mining [1, 2, 3, 4]. In other cases, data gathered from one data source is not enough to provide a complete view about a person or entity. Therefore, data needs to be linked or integrated together to provide a single complete view about a person, a product, a object, a geographical area or any entity to meet a specific business application need [5, 6].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. W. E. Winkler. Record Linkage Software and Methods for Merging Administrative Lists, BUREAU OF THE CENSUS STATISTICAL RESEARCH DIVISION, Statistical Research Report Series

    Google Scholar 

  2. P. Christen, A two-step classification approach to unsupervised record linkage. In AusDM’07, CRPIT vol. 70, pages 111–119, Gold Coast, Australia, 2007.

    Google Scholar 

  3. P. Christen and K. Goiser (2005), Assessing deduplication and data linkage quality: What to measure, in ‘Proceedings of the fourth Australasian Data Mining Conference (AusDM 2005)’, Sydney.

    Google Scholar 

  4. P. Christen and K. Goiser, Quality and Complexity Measures for Data Linkage and Deduplication, Accepted for Quality Measures in Data Mining, Springer, 2006.

    Google Scholar 

  5. L. Zhang and M. Wasson, TEMPLAR, Valentina, METHODS AND SYSTEMS FOR MATCHING RECORDS AND NORMALIZING NAMES, WO/2010/088052

    Google Scholar 

  6. C. Dozer and R. Haschart, Automatic Extraction and Linking of Person Names in Legal Text, RIAO-2000 Proceedings

    Google Scholar 

  7. H. L. Dunn, (1946) Record Linkage, American Journal of Public Health, 36, 1412–1415

    Article  Google Scholar 

  8. H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James, 1959, Automatic Linkage of vital records, Science 150(1959), 954–959

    Google Scholar 

  9. H. B. Newcombe and J. M. Kennedy, Record Linkage, Making Maximum Use of the Discriminating Power of Identifying Information, Communications of ACM, 1962, Vol. 5, Issue11

    Google Scholar 

  10. H. B. Newcombe, Record linking: The design of efficient systems for linking records into individual and family histories, Am J Hum Genet. 1967 May; 19(3 Pt 1): 335–359.

    Google Scholar 

  11. I. Fellegi and A. Sunter, A theory for record linkage, Journal of the American Statistical Society, 64(328):1183–1210, 1969.

    Google Scholar 

  12. W. Winkler. The State of Record Linkage and Current Research Problems. U.S. Bureau of the Census, Research Report, 1999.

    Google Scholar 

  13. K. Goiser and P. Christen, Towards Automated Record Linkage, In ACM KDD’08Proc. Fifth Australasian Data Mining Conference (AusDM2006)

    Google Scholar 

  14. J. S. Lawson, Record Linkage Techniques for Improving Online Genealogical Research using Census Index Records, ASA Section on Survey Research Methods.

    Google Scholar 

  15. W. W. Cohen and J. Richman, Learning to Match and Cluster Large High-Dimensional Data Sets For Data Integration, SIGKDD’ 02

    Google Scholar 

  16. P. Christen, Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification, In ACM KDD’08, Pages 151–159, Las Vegas, 2008

    Google Scholar 

  17. M. G. Elfeky, T. M. Ghanem, V. S. Verykios, A. R. Huwait, and A. K. Elmagarmid, Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital Government Web Service, Technical Report CSD-TR 03–024

    Google Scholar 

  18. D. A. Bayliss, Database systems and methods for linking records and entity representations with sufficiently high confidence, US 2009/0271424 A1

    Google Scholar 

  19. D. A. Bayliss, DATABASE SYSTEMS AND METHODS FOR LINKING RECORDS, WO/2010/003061

    Google Scholar 

  20. M. Fair, Generalized Record Linkage System – Statistics Canada’s Record Linkage Software, Austrian Journal of Statistics, Volume 33 (2004), Number 1&2, 37–53

    Google Scholar 

  21. P. Christen. Febrl-An Open Source Data Cleaning, Deduplication and Record Linkage System with a Graphical User Interface. August 2008.

    Google Scholar 

  22. M. Elfeky, V. Verykios, and A. Elmagarmid. TAILOR: A record linkage toolbox. In ICDE’02, pages 17–28, San Jose, 2002.

    Google Scholar 

  23. C. W. Kelman, J. Bass, and D. Holman, (2002), ‘Research use of linked health data – A best practice protocol’, Aust NZ Journal of Public Health, vol. 26, pp. 251–255.

    Article  Google Scholar 

  24. W. E. Winkler. Methods for evaluating and creating data quality, Elsevier Information Systems, 29(7):531–550, 2004.

    Google Scholar 

  25. M. Cochinwala, S. Dalal, A. K. Elmagarmid, V. S. Verykios, Record Matching, Past, Present and Future, Submitted to ACM Computing Surveys, 2003

    Google Scholar 

  26. D. Loshin, Ed Allburn, Customer Data Integration, Linkage Precision and Match Accuracy, Information Management Magazine, November 2004

    Google Scholar 

  27. E. Ted, H. Goldberg, J. Wooton, M. Cottini, and A. Khan, (1995). Financial Crimes Enforcement Network AI System (FAIS) Identifying Potential Money Laundering from Reports of Large Cash Transactions. AI Magazine, 16(4), 21–39. Retrieved from http://www.aaai.org/ojs/index.php/aimagazine/article/viewArticle/1169

  28. H. Issa, Application of Duplicate Records detection Techniques to Duplicate Payments in a Real Business Environment, Rutgers Business School, Rutgers University 2010

    Google Scholar 

  29. Inspector General. (1997), Audit report duplicate payments. Retrieved from http://www.va.gov/oig/52/reports/1997/7AF-G01-035--duppay.pdf.

  30. A. C. Novello, (2004). Duplicate Medicaid Transportation Payments, 1–4. Retrieved from http://www.osc.state.ny.us/audits/allaudits/093004/04f2.pdf

  31. L. Karl Branting, BAE Systems, Inc, Name Matching in Law Enforcement and CounterTerrorism, Columbia, MD 21046, USA, karl.branting@baesystems.com

    Google Scholar 

  32. J. Jonas and J. Harper, Effective counterterrorism and the limited role of predictive data mining, Policy Analysis, (584), 2006.

    Google Scholar 

  33. http://www.stanford.edu/dept/EHS/prod/researchlab/chem/inven/new_chem_inven_instr.html

  34. M.-Y. Kan and Y. F. Tan, Record Matching in Digital Library Metadata, Technical Opinion, Communications of The ACM, Vol. 51, No. 2, 02/2008

    Google Scholar 

  35. O. Charif_z, H. Omraniz, O. Kleinz, M. Schneiderz, and P. Trigano, A method and a tool for geocoding and record linkage, Working Paperking Paper, No 2010-17, 07/2010

    Google Scholar 

  36. C. Giraud-Carrier, J. Goodliffe, and B. Jones, Improving the Study of Campaign Contributors with Record Linkage

    Google Scholar 

  37. S. J. Grannis, J. M. Overhage, C. J. McDonald M, Analysis of Identifier Performance using a Deterministic Linkage Algorithm

    Google Scholar 

  38. S. Gomatam, R. Carter., M. Ariet, and G. Mitchell, An empirical comparison of record linkage procedures. Statistics in Medicine, vol. 21, no. 10, pp. 1485–1496, May 2002.

    Google Scholar 

  39. F. Maggi, A Survey of Probabilistic Record Matching Models, Techniques and Tools, Scienti_c Report TR-2008-22

    Google Scholar 

  40. A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1–16, 2007.

    Article  Google Scholar 

  41. W. E. Winkler, Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. Technical Report RR2000/05, US Bureau of the Census, 2000

    Google Scholar 

  42. P. Christen, T. Churches, and J. X. Zhu, Probabilistic Name and Address Clearning and Standardization, the Australasian Data Mining Workshop 2002

    Google Scholar 

  43. J. Friedman, T. Hastie, and R. Tibshirani, Additive logistic regression: a statistical view of boosting, the Annals of Statistic, 28(2):337–407, 2000.

    MATH  MathSciNet  Google Scholar 

  44. M. Elfeky, V. Verykios, and A. Elmagarmid. TAILOR: A record linkage toolbox. In ICDE’02, pages 17–28, San Jose, 2002.

    Google Scholar 

  45. S. Sarawagi and A. Bhamidipaty, Interactive deduplication using active learning. Proceedings of the 8th ACM SIGKDD conference, Edmonton, July 2002.

    Google Scholar 

  46. J. Rennie, Boosting with decision stumps and binary features, 2003

    Google Scholar 

  47. R. E. Schapire, The Boosting Approach to Machine Learning. An Overview Nonlinear Estimation and Classification, Springer, 2003

    Google Scholar 

  48. M. Jaro. Software Demonstrations. In Proc. of an International Workshop and Exposition – Record Linkage Techniques, Arlington, VA, USA, 1997.

    Google Scholar 

  49. E. Rundensteiner (Ed.), Special Issue on Data Transformation, IEEE Data Engineering Bulletin, March 1999.

    Google Scholar 

  50. L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra, Australia, April 2003.

    Google Scholar 

  51. D. Knuth, The Art of Computing Programming, Volume III, Addison-Wesley 1973.

    Google Scholar 

  52. M. Hernandez and S. Stolfo. Real World Data is Dirty: Data Cleansing and the Merge/Purge Problem. Journal of Data Mining and Knowledge Discovery, 2(1), pages 9–37, 1998.

    Google Scholar 

  53. A. McCallum, K. Nigam, and L. H. Ungar, “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching,” Proc. Sixth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD’00), pp. 169–178, 2000.

    Google Scholar 

  54. R. K. Chapman, D. A. Bayliss, G. C. Halliday, METHODS AND SYSTEMS FOR DYNAMICALLY CREATING KEYS IN A DATABASE SYSTEM, US 7739287 B1

    Google Scholar 

  55. M. A. Hernandez and S. J. Stolfo. The Merge/Purge Problem for Large Databases. In Proc. of 1995 ACT SIGMOD Conf., pages 127–138, 1995.

    Google Scholar 

  56. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10:707–710, 1966.

    MathSciNet  Google Scholar 

  57. http://cpansearch.perl.org/src/SCW/Text-JaroWinkler-0.1/strcmp95.c

  58. M. A. Jaro, 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84:414–420.

    Article  Google Scholar 

  59. M. A. Jaro, 1995. Probabilistic linkage of large public health data files (disc: P687–689). Statistics in Medicine 14:491–498

    Article  Google Scholar 

  60. W. E. Winkler, 1999. The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/04. Available from http://www.census.gov/srd/www/byname.html.

  61. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srinivasta. Approximate string joins in a database. In Proc. 27th Int. Conf. on Very Large Data Bases, pages 491–500, 2001.

    Google Scholar 

  62. W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-03), 2003. To appear

    Google Scholar 

  63. Journal of the American Statistical Association, 84(406), pages 414–420, 1989.

    Google Scholar 

  64. S. Tejada, C. Knoblock, and S. Minton, Learning domain-independent string transformation weights for high accuracy object identification, In ACM KDD’02, pages 350–359, dmonton, 2002.

    Google Scholar 

  65. U. Y. Nahm, M. Bilenko, and R. J. Mooney. Two approaches to handling noisy variation in text mining. In TextML’02, pages 18–27, Sydney, 2002.

    Google Scholar 

  66. L. Gu and R. Baxter. Decision models for record linkage. In Selected Papers from AusDM, Springer LNCS 3755, pages 146–160, 2006.

    Google Scholar 

  67. W. E. Winkler, (1995), Advanced methods for record linkage, American Statistical Association, Proceedings of the Section on Survey Research Methods, pp. 467–472.

    Google Scholar 

  68. R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 25–27, Washington DC, 2003.

    Google Scholar 

  69. W. E. Winkler, (1995), Matching and Record Linkage, in B. G. Cox et al. (ed.) Business Survey. Methods, New York: J. Wiley, 355–384.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ling Qin Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Zhang, L.Q. (2011). Record Linkage Methodology and Applications. In: Furht, B., Escalante, A. (eds) Handbook of Data Intensive Computing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1415-5_14

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-1415-5_14

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-1414-8

  • Online ISBN: 978-1-4614-1415-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics