Private Record Linkage: Comparison of Selected Techniques for Name Matching
The rise of Big Data Analytics has shown the utility of analyzing all aspects of a problem by bringing together disparate data sets. Efficient and accurate private record linkage algorithms are necessary to achieve this. However, records are often linked based on personally identifiable information, and protecting the privacy of individuals is critical. This paper contributes to this field by studying an important component of the private record linkage problem: linking based on names while keeping those names encrypted, both on disk and in memory. We explore the applicability, accuracy and speed of three different primary approaches to this problem (along with several variations) and compare the results to common name-matching metrics on unprotected data. While these approaches are not new, this paper provides a thorough analysis on a range of datasets containing systematically introduced flaws common to name-based data entry, such as typographical errors, optical character recognition errors, and phonetic errors.
KeywordsRecord Linkage Optical Character Recognition Encrypt Data Dice Coefficient Data Consumer
This work was partially supported by the LexisNexis corporation.
- 1.Christen, P.: A comparison of personal name matching: techniques and practical issues. In: Sixth IEEE International Conference on Data Mining Workshops, ICDM Workshops 2006, pp. 290–294. IEEE (2006)Google Scholar
- 2.Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012)Google Scholar
- 4.Dreßler, K., Ngomo, A.C.N.: Time-efficient execution of bounded jaro-winkler distances. In: Proceedings of the 9th International Conference on Ontology Matching, vol. 1317, pp. 37–48. CEUR-WS. org (2014)Google Scholar
- 7.Muñoz, J.C., Tamura, G., Villegas, N.M., Müller, H.A.: Surprise: user-controlled granular privacy and security for personal data in smartercontext. In: Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research, pp. 131–145. IBM Corp. (2012)Google Scholar
- 8.Philips, L.: Hanging on the metaphone. Comput. Lang. 7(12) (1990)Google Scholar
- 9.Snae, C.: A comparison and analysis of name matching algorithms. Int. J. Appl. Sci. Eng. Technol. 4(1), 252–257 (2007)Google Scholar
- 10.Tran, K.N., Vatsalan, D., Christen, P.: Geco: an online personal data generator and corruptor. In: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, pp. 2473–2476. ACM (2013)Google Scholar
- 11.Vatsalan, D., Christen, P., Verykios, V.S.: An efficient two-party protocol for approximate matching in private record linkage. In: Proceedings of the Ninth Australasian Data Mining Conference, vol. 121, pp. 125–136. Australian Computer Society, Inc. (2011)Google Scholar
- 12.Yakout, M., Atallah, M.J., Elmagarmid, A.: Efficient private record linkage. In: IEEE 25th International Conference on Data Engineering, ICDE 2009, pp. 1283–1286. IEEE (2009)Google Scholar