Abstract
As information technology advances rapidly and Internet blooms, a lot of business tends to electronization and globalization. Individuals and organizations have more channels or methods to expose information and gather information. The result is that individuals and organizations face the increasing challenges to process the large volumes of data and find the relevant quality information to fit their specific business needs. In addition, the data gathered from multiple resources usually contains errors and duplicate information. There is a strong need to detect duplicates and remove them in data preparation phase before performing advanced data mining [1, 2, 3, 4]. In other cases, data gathered from one data source is not enough to provide a complete view about a person or entity. Therefore, data needs to be linked or integrated together to provide a single complete view about a person, a product, a object, a geographical area or any entity to meet a specific business application need [5, 6].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
W. E. Winkler. Record Linkage Software and Methods for Merging Administrative Lists, BUREAU OF THE CENSUS STATISTICAL RESEARCH DIVISION, Statistical Research Report Series
P. Christen, A two-step classification approach to unsupervised record linkage. In AusDM’07, CRPIT vol. 70, pages 111–119, Gold Coast, Australia, 2007.
P. Christen and K. Goiser (2005), Assessing deduplication and data linkage quality: What to measure, in ‘Proceedings of the fourth Australasian Data Mining Conference (AusDM 2005)’, Sydney.
P. Christen and K. Goiser, Quality and Complexity Measures for Data Linkage and Deduplication, Accepted for Quality Measures in Data Mining, Springer, 2006.
L. Zhang and M. Wasson, TEMPLAR, Valentina, METHODS AND SYSTEMS FOR MATCHING RECORDS AND NORMALIZING NAMES, WO/2010/088052
C. Dozer and R. Haschart, Automatic Extraction and Linking of Person Names in Legal Text, RIAO-2000 Proceedings
H. L. Dunn, (1946) Record Linkage, American Journal of Public Health, 36, 1412–1415
H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James, 1959, Automatic Linkage of vital records, Science 150(1959), 954–959
H. B. Newcombe and J. M. Kennedy, Record Linkage, Making Maximum Use of the Discriminating Power of Identifying Information, Communications of ACM, 1962, Vol. 5, Issue11
H. B. Newcombe, Record linking: The design of efficient systems for linking records into individual and family histories, Am J Hum Genet. 1967 May; 19(3 Pt 1): 335–359.
I. Fellegi and A. Sunter, A theory for record linkage, Journal of the American Statistical Society, 64(328):1183–1210, 1969.
W. Winkler. The State of Record Linkage and Current Research Problems. U.S. Bureau of the Census, Research Report, 1999.
K. Goiser and P. Christen, Towards Automated Record Linkage, In ACM KDD’08Proc. Fifth Australasian Data Mining Conference (AusDM2006)
J. S. Lawson, Record Linkage Techniques for Improving Online Genealogical Research using Census Index Records, ASA Section on Survey Research Methods.
W. W. Cohen and J. Richman, Learning to Match and Cluster Large High-Dimensional Data Sets For Data Integration, SIGKDD’ 02
P. Christen, Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification, In ACM KDD’08, Pages 151–159, Las Vegas, 2008
M. G. Elfeky, T. M. Ghanem, V. S. Verykios, A. R. Huwait, and A. K. Elmagarmid, Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital Government Web Service, Technical Report CSD-TR 03–024
D. A. Bayliss, Database systems and methods for linking records and entity representations with sufficiently high confidence, US 2009/0271424 A1
D. A. Bayliss, DATABASE SYSTEMS AND METHODS FOR LINKING RECORDS, WO/2010/003061
M. Fair, Generalized Record Linkage System – Statistics Canada’s Record Linkage Software, Austrian Journal of Statistics, Volume 33 (2004), Number 1&2, 37–53
P. Christen. Febrl-An Open Source Data Cleaning, Deduplication and Record Linkage System with a Graphical User Interface. August 2008.
M. Elfeky, V. Verykios, and A. Elmagarmid. TAILOR: A record linkage toolbox. In ICDE’02, pages 17–28, San Jose, 2002.
C. W. Kelman, J. Bass, and D. Holman, (2002), ‘Research use of linked health data – A best practice protocol’, Aust NZ Journal of Public Health, vol. 26, pp. 251–255.
W. E. Winkler. Methods for evaluating and creating data quality, Elsevier Information Systems, 29(7):531–550, 2004.
M. Cochinwala, S. Dalal, A. K. Elmagarmid, V. S. Verykios, Record Matching, Past, Present and Future, Submitted to ACM Computing Surveys, 2003
D. Loshin, Ed Allburn, Customer Data Integration, Linkage Precision and Match Accuracy, Information Management Magazine, November 2004
E. Ted, H. Goldberg, J. Wooton, M. Cottini, and A. Khan, (1995). Financial Crimes Enforcement Network AI System (FAIS) Identifying Potential Money Laundering from Reports of Large Cash Transactions. AI Magazine, 16(4), 21–39. Retrieved from http://www.aaai.org/ojs/index.php/aimagazine/article/viewArticle/1169
H. Issa, Application of Duplicate Records detection Techniques to Duplicate Payments in a Real Business Environment, Rutgers Business School, Rutgers University 2010
Inspector General. (1997), Audit report duplicate payments. Retrieved from http://www.va.gov/oig/52/reports/1997/7AF-G01-035--duppay.pdf.
A. C. Novello, (2004). Duplicate Medicaid Transportation Payments, 1–4. Retrieved from http://www.osc.state.ny.us/audits/allaudits/093004/04f2.pdf
L. Karl Branting, BAE Systems, Inc, Name Matching in Law Enforcement and CounterTerrorism, Columbia, MD 21046, USA, karl.branting@baesystems.com
J. Jonas and J. Harper, Effective counterterrorism and the limited role of predictive data mining, Policy Analysis, (584), 2006.
http://www.stanford.edu/dept/EHS/prod/researchlab/chem/inven/new_chem_inven_instr.html
M.-Y. Kan and Y. F. Tan, Record Matching in Digital Library Metadata, Technical Opinion, Communications of The ACM, Vol. 51, No. 2, 02/2008
O. Charif_z, H. Omraniz, O. Kleinz, M. Schneiderz, and P. Trigano, A method and a tool for geocoding and record linkage, Working Paperking Paper, No 2010-17, 07/2010
C. Giraud-Carrier, J. Goodliffe, and B. Jones, Improving the Study of Campaign Contributors with Record Linkage
S. J. Grannis, J. M. Overhage, C. J. McDonald M, Analysis of Identifier Performance using a Deterministic Linkage Algorithm
S. Gomatam, R. Carter., M. Ariet, and G. Mitchell, An empirical comparison of record linkage procedures. Statistics in Medicine, vol. 21, no. 10, pp. 1485–1496, May 2002.
F. Maggi, A Survey of Probabilistic Record Matching Models, Techniques and Tools, Scienti_c Report TR-2008-22
A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1–16, 2007.
W. E. Winkler, Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. Technical Report RR2000/05, US Bureau of the Census, 2000
P. Christen, T. Churches, and J. X. Zhu, Probabilistic Name and Address Clearning and Standardization, the Australasian Data Mining Workshop 2002
J. Friedman, T. Hastie, and R. Tibshirani, Additive logistic regression: a statistical view of boosting, the Annals of Statistic, 28(2):337–407, 2000.
M. Elfeky, V. Verykios, and A. Elmagarmid. TAILOR: A record linkage toolbox. In ICDE’02, pages 17–28, San Jose, 2002.
S. Sarawagi and A. Bhamidipaty, Interactive deduplication using active learning. Proceedings of the 8th ACM SIGKDD conference, Edmonton, July 2002.
J. Rennie, Boosting with decision stumps and binary features, 2003
R. E. Schapire, The Boosting Approach to Machine Learning. An Overview Nonlinear Estimation and Classification, Springer, 2003
M. Jaro. Software Demonstrations. In Proc. of an International Workshop and Exposition – Record Linkage Techniques, Arlington, VA, USA, 1997.
E. Rundensteiner (Ed.), Special Issue on Data Transformation, IEEE Data Engineering Bulletin, March 1999.
L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra, Australia, April 2003.
D. Knuth, The Art of Computing Programming, Volume III, Addison-Wesley 1973.
M. Hernandez and S. Stolfo. Real World Data is Dirty: Data Cleansing and the Merge/Purge Problem. Journal of Data Mining and Knowledge Discovery, 2(1), pages 9–37, 1998.
A. McCallum, K. Nigam, and L. H. Ungar, “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching,” Proc. Sixth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD’00), pp. 169–178, 2000.
R. K. Chapman, D. A. Bayliss, G. C. Halliday, METHODS AND SYSTEMS FOR DYNAMICALLY CREATING KEYS IN A DATABASE SYSTEM, US 7739287 B1
M. A. Hernandez and S. J. Stolfo. The Merge/Purge Problem for Large Databases. In Proc. of 1995 ACT SIGMOD Conf., pages 127–138, 1995.
V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10:707–710, 1966.
http://cpansearch.perl.org/src/SCW/Text-JaroWinkler-0.1/strcmp95.c
M. A. Jaro, 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84:414–420.
M. A. Jaro, 1995. Probabilistic linkage of large public health data files (disc: P687–689). Statistics in Medicine 14:491–498
W. E. Winkler, 1999. The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/04. Available from http://www.census.gov/srd/www/byname.html.
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srinivasta. Approximate string joins in a database. In Proc. 27th Int. Conf. on Very Large Data Bases, pages 491–500, 2001.
W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-03), 2003. To appear
Journal of the American Statistical Association, 84(406), pages 414–420, 1989.
S. Tejada, C. Knoblock, and S. Minton, Learning domain-independent string transformation weights for high accuracy object identification, In ACM KDD’02, pages 350–359, dmonton, 2002.
U. Y. Nahm, M. Bilenko, and R. J. Mooney. Two approaches to handling noisy variation in text mining. In TextML’02, pages 18–27, Sydney, 2002.
L. Gu and R. Baxter. Decision models for record linkage. In Selected Papers from AusDM, Springer LNCS 3755, pages 146–160, 2006.
W. E. Winkler, (1995), Advanced methods for record linkage, American Statistical Association, Proceedings of the Section on Survey Research Methods, pp. 467–472.
R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 25–27, Washington DC, 2003.
W. E. Winkler, (1995), Matching and Record Linkage, in B. G. Cox et al. (ed.) Business Survey. Methods, New York: J. Wiley, 355–384.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Zhang, L.Q. (2011). Record Linkage Methodology and Applications. In: Furht, B., Escalante, A. (eds) Handbook of Data Intensive Computing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1415-5_14
Download citation
DOI: https://doi.org/10.1007/978-1-4614-1415-5_14
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1414-8
Online ISBN: 978-1-4614-1415-5
eBook Packages: Computer ScienceComputer Science (R0)