A Framework for Statistical Entity Identification in R
Entity identification deals with matching records from different datasets or within one dataset that represent the same real-world entity when unique identifiers are not available. Enabling data integration at record level as well as the detection of duplicates, entity identification plays a major role in data preprocessing, especially concerning data quality. This paper presents a framework for statistical entity identification in particular focusing on probabilistic record linkage and string matching and its implementation in R. According to the stages of the entity identification process, the framework is structured into seven core components: data preparation, candidate selection, comparison, scoring, classification, decision, and evaluation. Samples of real-world CRM datasets serve as illustrative examples.
Unable to display preview. Download preview PDF.
- BAXTER, R., CHRISTEN, P., and CHURCHES, T. (2003): A Comparison of Fast Blocking Methods for Record Linkage. In: Proc. 1st Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 9th ACM SIGKDD. Washington, D.C., August 2003.Google Scholar
- DENK, M. (2002): Statistical Data Combination: A Metadata Framework for Record Linkage Procedures. Doctoral thesis, Dept. of Statistics, University of Vienna.Google Scholar
- DENK, M. (2006): A Framework for Statistical Entity Identification to Enhance Data Quality. Report wp6dBiz14_br1. (EC3, Vienna, Austria). Submitted.Google Scholar
- DENK, M. (2007): The StringMatch Toolbox: Determining String Compliance in R. In: Proc. IASC 07 - Statistics for Data Mining, Learning and Knowledge Extraction. Aveiro, Por-tugal, August 2007. Accepted.Google Scholar
- DENK, M., FROESCHL, K.A., HACKL, P. and RAINER, N. (Eds.) (2004): Special Issue on Data Integration and Record Matching, Austrian J. Statistics, 33.Google Scholar
- DENK, M., HACKL, P. and RAINER, N. (2005): String Matching Techniques: An Empirical Assessment Based on Statistics Austria’s Business Register. Austrian J. Statistics, 34(3), 235-250.Google Scholar
- GILL, L.E. (2001): Methods for automatic record matching and linking in their use in National Statistics. GSS Methodology Series, NSMS25, ONS UK.Google Scholar
- NEILING, M. (2004): Identifizierung von Realwelt-Objekten in multiplen Datenbanken. Doc-toral thesis, TU Cottbus. In German.Google Scholar
- R DEVELOPMENT CORE TEAM (2006): R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.Google Scholar
- WINKLER, W.E. (1994): Advanced Methods for Record Linkage. In: Proc. Section on Survey Research Methods. American Statistical Association, 467-472.Google Scholar