Advertisement

A Framework for Statistical Entity Identification in R

  • Michaela Denk
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)

Abstract

Entity identification deals with matching records from different datasets or within one dataset that represent the same real-world entity when unique identifiers are not available. Enabling data integration at record level as well as the detection of duplicates, entity identification plays a major role in data preprocessing, especially concerning data quality. This paper presents a framework for statistical entity identification in particular focusing on probabilistic record linkage and string matching and its implementation in R. According to the stages of the entity identification process, the framework is structured into seven core components: data preparation, candidate selection, comparison, scoring, classification, decision, and evaluation. Samples of real-world CRM datasets serve as illustrative examples.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. BAXTER, R., CHRISTEN, P., and CHURCHES, T. (2003): A Comparison of Fast Blocking Methods for Record Linkage. In: Proc. 1st Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 9th ACM SIGKDD. Washington, D.C., August 2003.Google Scholar
  2. BELIN, T.R. and RUBIN, D.B. (1995): A Method for Calibrating False-Match Rates in Record Linkage. J. American Statistical Association, 90, 694-707.zbMATHCrossRefGoogle Scholar
  3. DEMPSTER, A.P., LAIRD, N.M. and RUBIN, D.B. (1977): Maximum Likelihood from In-complete Data via the EM-Algorithm. J. Royal Statistical Society (B), 39, 1-38.zbMATHMathSciNetGoogle Scholar
  4. DENK, M. (2002): Statistical Data Combination: A Metadata Framework for Record Linkage Procedures. Doctoral thesis, Dept. of Statistics, University of Vienna.Google Scholar
  5. DENK, M. (2006): A Framework for Statistical Entity Identification to Enhance Data Quality. Report wp6dBiz14_br1. (EC3, Vienna, Austria). Submitted.Google Scholar
  6. DENK, M. (2007): The StringMatch Toolbox: Determining String Compliance in R. In: Proc. IASC 07 - Statistics for Data Mining, Learning and Knowledge Extraction. Aveiro, Por-tugal, August 2007. Accepted.Google Scholar
  7. DENK, M., FROESCHL, K.A., HACKL, P. and RAINER, N. (Eds.) (2004): Special Issue on Data Integration and Record Matching, Austrian J. Statistics, 33.Google Scholar
  8. DENK, M., HACKL, P. and RAINER, N. (2005): String Matching Techniques: An Empirical Assessment Based on Statistics Austria’s Business Register. Austrian J. Statistics, 34(3), 235-250.Google Scholar
  9. FELLEGI, I.P. and SUNTER, A.B. (1969): A Theory for Record Linkage. J. American Statis-tical Association, 64, 1183-1210.CrossRefGoogle Scholar
  10. GILL, L.E. (2001): Methods for automatic record matching and linking in their use in National Statistics. GSS Methodology Series, NSMS25, ONS UK.Google Scholar
  11. HERZOG, T.N., SCHEUREN, F.J. and WINKLER, W.E. (2007): Data Quality and Record Linkage Techniques. Springer, New York.zbMATHGoogle Scholar
  12. JARO, M.A. (1989): Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84, 414-420.CrossRefGoogle Scholar
  13. NAVARRO, G. (2001): A guided tour to approximate string matching. ACM Computing Sur-veys, 33(1), 31-88.CrossRefGoogle Scholar
  14. NEILING, M. (2004): Identifizierung von Realwelt-Objekten in multiplen Datenbanken. Doc-toral thesis, TU Cottbus. In German.Google Scholar
  15. R DEVELOPMENT CORE TEAM (2006): R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.Google Scholar
  16. WINKLER, W.E. (1994): Advanced Methods for Record Linkage. In: Proc. Section on Survey Research Methods. American Statistical Association, 467-472.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Michaela Denk
    • 1
  1. 1.EC3 - E-Commerce Competence CenterViennaAustria

Personalised recommendations