Advertisement

Advanced Record Linkage Methods and Privacy Aspects for Population Reconstruction—A Survey and Case Studies

  • Peter ChristenEmail author
  • Dinusha Vatsalan
  • Zhichun Fu
Chapter

Abstract

Recent times have seen an increased interest into techniques that allow the linking of records across databases. The main challenges of record linkage are (1) scalability to the increasingly large databases common today; (2) accurate and efficient classification of compared records into matches and non-matches in the presence of variations and errors in the data; and (3) privacy issues that occur when the linking of records is based on sensitive personal information about individuals. The first challenge has been addressed by the development of scalable indexing techniques, the second through advanced classification techniques that either employ machine learning- or graph-based methods, and the third challenge is investigated by research into privacy-preserving record linkage (PPRL). In this chapter, we describe these major challenges of record linkage in the context of population reconstruction. We survey recent developments of advanced record linkage methods, discuss two real-world case studies, and provide directions for future research.

Keywords

Linkage Quality Record Linkage Bloom Filter True Match Indexing Technique 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

The authors would like to thank Mac Boot (The Australian National University) and Vassilios S. Verykios (Hellenic Open University) for their contributions to the work presented in this chapter.

References

  1. Al-Lawati, A., Lee, D., & McDaniel, P. (2005). Blocking-aware private record linkage. In International Workshop on Information Quality in Information Systems (pp. 59–68). Baltimore.Google Scholar
  2. Antonie, L., Inwood, K., Lizotte, D. J., & Ross, J. A. (2014a). Tracking people over time in 19th century Canada for longitudinal analysis. Machine Learning, 95, 129–146.CrossRefGoogle Scholar
  3. Antonie, L., Inwood, K., & Ross, A. (2014b). Dancing with dirty data: Problems in the extraction of life-course evidence from historical censuses. In Population Reconstruction.Google Scholar
  4. Arasu, A., Götz, M., & Kaushik, R. (2010). On active learning of record matching packages. In ACM SIGMOD (pp. 783–794). Indianapolis.Google Scholar
  5. Atallah, M. J., Kerschbaum, F., & Du, W. (2003). Secure and private sequence comparisons. In ACM Workshop on Privacy in the Electronic Society (pp. 39–44). Washington, DC.Google Scholar
  6. Baffour, B., King, T., & Valente, P. (2013). The modern census: Evolution, examples and evaluation. International Statistical Review, 81(3), 407–425.CrossRefGoogle Scholar
  7. Bellare, K., Iyengar, S., Parameswaran, A. G., & Rastogi, V. (2012). Active sampling for entity matching. In ACM SIGKDD (pp. 1131–1139). Beijing.Google Scholar
  8. Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 5.Google Scholar
  9. Bilenko, M., Kamath, B., & Mooney, R. J. (2006). Adaptive blocking: Learning to scale up record linkage. In IEEE ICDM (pp. 87–96). Hong Kong.Google Scholar
  10. Block, W. C., & Star, D. L. (1995). Data entry and verification. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 28(1), 63–65.CrossRefGoogle Scholar
  11. Bloothooft, G. (1995). Multi-source family reconstruction. History and computing, 7(2), 90–103.CrossRefGoogle Scholar
  12. Bonomi, L., Xiong, L., Chen, R., & Fung, B. (2012). Frequent grams based embedding for privacy preserving record linkage. In CIKM (pp. 1597–1601). Maui, Hawaii.Google Scholar
  13. Chiang, Y. H., Doan, A., & Naughton, J. F. (2014). Tracking entities in the dynamic world: A fast algorithm for matching temporal records. PVLDB, 7(6).Google Scholar
  14. Christen, P. (2006). A comparison of personal name matching: Techniques and practical issues. In Workshop on Mining Complex Data, held at IEEE ICDM. Hong Kong.Google Scholar
  15. Christen, P. (2012a). Data Matching—Concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications. Berlin: Springer.Google Scholar
  16. Christen, P. (2012b). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9), 1537–1555.CrossRefGoogle Scholar
  17. Christen, P. (2014). Advanced record linkage methods and privacy aspects for population reconstruction. In Population Reconstruction.Google Scholar
  18. Christen, P., & Gayler, R.W. (2013). Adaptive temporal entity resolution on dynamic databases. In PAKDD (Vol. 7819, pp. 558–569). Gold Coast, Australia: Springer.Google Scholar
  19. Christen, P., Gayler, R. W., & Hawking, D. (2009). Similarity-aware indexing for real-time entity resolution. In ACM CIKM (pp. 1565–1568). Hong Kong.Google Scholar
  20. Christen, P., & Vatsalan, D. (2013). Flexible and extensible generation and corruption of personal data. In ACM CIKM (pp. 1165–1168). San Francisco.Google Scholar
  21. Christen, P., Vatsalan, D., & Verykios, V. S. (2014). Challenges for privacy preservation in data integration. ACM Journal Data and Information Quality, 5(1–2), 4.Google Scholar
  22. Churches, T. (2003). A proposed architecture and method of operation for improving the protection of privacy and confidentiality in disease registers. BMC Med Res Methodol, 3(1), 1.Google Scholar
  23. Churches, T., Christen, P., Lim, K., & Zhu, J. X. (2002). Preparation of name and address data for record linkage using hidden Markov models. BMC Med Inform Decis Mak, 2, 9.Google Scholar
  24. Dey, D., Mookerjee, V. S., & Liu, D. (2010). Efficient techniques for online record linkage. IEEE Transactions on Knowledge and Data Engineering, 23(3), 373–387.CrossRefGoogle Scholar
  25. de Vries, T., Ke, H., Chawla, S., & Christen, P. (2011). Robust record linkage blocking using suffix arrays and Bloom filters. ACM Transactions on Knowledge Discovery from Data, 5(2), 9.Google Scholar
  26. Dong, X. L., Halevy, A., & Madhavan, J. (2005). Reference reconciliation in complex information spaces. In ACM SIGMOD (pp. 85–96). Baltimore.Google Scholar
  27. Draisbach, U., Naumann, F., Szott, S., & Wonneberg, O. (2012). Adaptive windows for duplicate detection. In IEEE ICDE (pp. 1073–1083). Washington, DC.Google Scholar
  28. Durham, E.A. (2012). A framework for accurate, efficient private record linkage. Ph.D. thesis, Faculty of the Graduate School of Vanderbilt University, Nashville, TN.Google Scholar
  29. Durham, E. A., Xue, Y., Kantarcioglu, M., & Malin, B. (2012). Quantifying the correctness, computational complexity, and security of privacy-preserving string comparators for record linkage. Information Fusion, 13(4), 245–259.CrossRefGoogle Scholar
  30. Dwork, C. (2006). Differential privacy. Automata, languages and programming (pp. 1–12).Google Scholar
  31. Efremova, J., Ranjbar-Sahraei, B., Oliehoek, F. A., Calders, T., & Tuyls, K. (2015). A baseline method for genealogical entity resolution. In: G. Bloothooft, P. Christen, K. Mandemakers, M. Schraagen (Eds.), Population reconstruction. Berlin: Springer.Google Scholar
  32. Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.CrossRefGoogle Scholar
  33. Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.CrossRefGoogle Scholar
  34. Fu, Z., Boot, M., Christen, P., & Zhou, J. (2014a). Automatic record linkage of individuals and households in historical census data. International Journal of Humanities and Arts Computing, 8(2), 204–225.Google Scholar
  35. Fu, Z., Christen, P., & Zhou, J. (2014b). A graph matching method for historical census household linkage. In PAKDD (Vol. 8443, pp. 485–496). Tainan, Taiwan: Springer.Google Scholar
  36. Fu, Z., Christen, P., & Boot, M. (2011a). Automatic cleaning and linking of historical census data using household information. In Workshop on Domain Driven Data Mining, held at IEEE ICDM. Vancouver.Google Scholar
  37. Fu, Z., Christen, P., & Boot, M. (2011b). A supervised learning and group linking method for historical census household linkage. In AusDM, CRPIT (Vol. 121). Ballarat, Australia.Google Scholar
  38. Fu, Z., Zhou, J., Christen, P., & Boot, M. (2012) Multiple instance learning for group record linkage. In PAKDD (Vol. 7301, pp. 171–182). Kuala Lumpur, Malaysia: Springer.Google Scholar
  39. Fure, E. (2000). Interactive record linkage: The cumulative construction of life courses. Demographic Research, 3(11), 3–11.Google Scholar
  40. Glasson, E., De Klerk, N., Bass, J., Rosman, D., Palmer, L. J., & Holman, D. (2008). Cohort profile: The Western Australian family connections genealogical project. International Journal of Epidemiology, 37(1), 30–35.CrossRefGoogle Scholar
  41. Hernandez, M. A., & Stolfo, S. J. (1995). The merge/purge problem for large databases. In ACM SIGMOD (pp. 127–138). San Jose.Google Scholar
  42. Herzog, T. N., Scheuren, F. J., & Winkler, W. E. (2007). Data quality and record linkage techniques. Berlin: Springer.Google Scholar
  43. Inan, A., Kantarcioglu, M., Bertino, E., & Scannapieco, M. (2008). A hybrid approach to private record linkage. In IEEE ICDE (pp. 496–505). Cancun, Mexico.Google Scholar
  44. Inan, A., Kantarcioglu, M., Ghinita, G., & Bertino, E. (2010). Private record matching using differential privacy. In EDBT (pp. 123–134). Lausanne, Switzerland.Google Scholar
  45. Ioannou, E., Nejdl, W., Niederée, C., & Velegrakis, Y. (2010). On-the-fly entity-aware query processing in the presence of linkage. VLDB Endowment, 3(1), 429–438.Google Scholar
  46. Jin, L., Li, C., & Mehrotra, S. (2003). Efficient record linkage in large data sets. In DASFAA (pp. 137–146). Tokyo.Google Scholar
  47. Jonas, J., & Harper, J. (2006). Effective counterterrorism and the limited role of predictive data mining. Policy Analysis (584) (2006).Google Scholar
  48. Kalashnikov, D. V., & Mehrotra, S. (2006). Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems, 31(2), 716–767.CrossRefGoogle Scholar
  49. Karakasidis, A., & Verykios, V. S. (2009). Privacy preserving record linkage using phonetic codes. In Fourth Balkan Conference in Informatics, IEEE (pp. 101–106). Thessaloniki, Greece.Google Scholar
  50. Karakasidis, A., & Verykios, V. S. (2010). Advances in privacy preserving record linkage. In E-activity and Innovative Technology, Advances in Applied Intelligence Technologies Book Series (pp. 22–34). IGI Global.Google Scholar
  51. Karakasidis, A., & Verykios, V. S. (2012). Reference table based k-anonymous private blocking. In ACM Symposium on Applied Computing (pp. 859–864). Trento, Italy.Google Scholar
  52. Karakasidis, A., Verykios, V. S., & Christen, P. (2011). Fake injection strategies for private phonetic matching. In International Workshop on Data Privacy Management. Leuven, Belgium.Google Scholar
  53. Karapiperis, D., & Verykios, V. S. (2014). An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. IEEE Transactions on Knowledge and Data Engineering.Google Scholar
  54. Kejriwal, M., & Miranker, D. P. (2013). An unsupervised algorithm for learning blocking schemes. In IEEE ICDM (pp. 340–349).Google Scholar
  55. Kelman, C. W., Bass, J., & Holman, D. (2002). Research use of linked health data—A best practice protocol. Aust NZ Journal of Public Health, 26, 251–255.CrossRefGoogle Scholar
  56. Köpcke, H., & Rahm, E. (2010). Frameworks for entity matching: A comparison. Data and Knowledge Engineering, 69(2), 197–210.CrossRefGoogle Scholar
  57. Kum, H. C., Krishnamurthy, A., Machanavajjhala, A., & Ahalt, S. (2013). Population informatics: Tapping the social genome to advance society: A vision for putting ‘Big Data’ to work for population informatics. Computer, PP(99).Google Scholar
  58. Kuzu, M., Kantarcioglu, M., Inan, A., Bertino, E., Durham, E., & Malin, B. (2013). Efficient privacy-aware record integration. In EDBT (pp. 167–178). Genoa, Italy.Google Scholar
  59. Lee, D., Kang, J., Mitra, P., Giles, C. L., & On, B. W. (2007). Are your citations clean? Commununications of the ACM, 50, 33–38.CrossRefGoogle Scholar
  60. Li, F., Chen, Y., Luo, B., Lee, D., & Liu, P. (2011). Privacy preserving group linkage. In SSDBM (Vol. 6809, pp. 432–450). Portland: Springer LNCS.Google Scholar
  61. Li, P., Dong, X. L., Maurino, A., & Srivastava, D. (2011). Linking temporal records. VLDB Endowment, 4(11), 956–967.Google Scholar
  62. Lindell, Y., & Pinkas, B. (2009). Secure multiparty computation for privacy-preserving data mining. Journal of Privacy and Confidentiality, 1(1), 5.Google Scholar
  63. Michelson, M., & Knoblock, C. A. (2006). Learning blocking schemes for record linkage. In AAAI. Boston.Google Scholar
  64. Naumann, F., & Herschel, M. (2010). An introduction to duplicate detection. Synthesis Lectures on Data Management (vol. 3). Morgan and Claypool Publishers.Google Scholar
  65. Newcombe, H. B. (1988). Handbook of record linkage: Methods for health and statistical studies, administration, and business. New York: Oxford University Press Inc.Google Scholar
  66. Newcombe, H. B., & Kennedy, J. M. (1962). Record linkage: making maximum use of the discriminating power of identifying information. Communications of the ACM, 5(11), 563–566.CrossRefGoogle Scholar
  67. Newton, G. (2013). Family reconstitution in an urban context: Some observations and methods. Technical Report, University of Cambridge, CWPESH No. 12.Google Scholar
  68. Office for National Statistics. (2013). Beyond 2011 matching anonymous data. Methods and Policies Report M9.Google Scholar
  69. On, B. W., Koudas, N., Lee, D., & Srivastava, D. (2007). Group linkage. In IEEE ICDE (pp. 496–505). Istanbul.Google Scholar
  70. Pang, C., Gu, L., Hansen, D., & Maeder, A. (2009). Privacy-preserving fuzzy matching using a public reference table. Intelligent Patient Management, 189, 71–89.Google Scholar
  71. Quass, D., & Starkey, P. (2003). Record linkage for genealogical databases. In ACM SIGKDD Workshop on Data Cleaning, Record Linkage and Object Consolidation (pp. 40–42). Washington DC.Google Scholar
  72. Ramadan, B., Christen, P., & Liang, H. (2014). Dynamic sorted neighborhood indexing for real-time entity resolution. In ADC (Vol. 8506, pp. 1–12). Brisbane: Springer LNCS.Google Scholar
  73. Ranbaduge, T., Christen, P., & Vatsalan, D. (2014). Tree based scalable indexing for multi-party privacy-preserving record linkage. In AusDM, CRPIT (Vol. 158). Brisbane, Australia.Google Scholar
  74. Rastogi, V., Dalvi, N., & Garofalakis, M. (2011). Large-scale collective entity matching. VLDB Endowment, 4, 208–218.Google Scholar
  75. Ravikumar, P., Cohen, W., & Fienberg, S. (2004). A secure protocol for computing string distance metrics. In Workshop on Privacy and Security Aspects of Data Mining held at IEEE ICDM (pp. 40–46). Brighton, UK.Google Scholar
  76. Reid, A., Davies, R., & Garrett, E. (2002). Nineteenth-century scottish demography from linked censuses and civil registers: A’sets of related individuals’ approach. History and Computing, 14(1–2), 61–86.CrossRefGoogle Scholar
  77. Rudin, C., & Wagstaff, K. L. (2013). Machine learning for science and society. Machine Learning, 95(1), 1–9.Google Scholar
  78. Ruggles, S. (2002). Linking historical censuses: A new approach. History and Computing, 14(1–2), 213–224.CrossRefGoogle Scholar
  79. Scannapieco, M., Figotin, I., Bertino, E., & Elmagarmid, A. K. (2007). Privacy preserving schema and data matching. In ACM SIGMOD (pp. 653–664). Beijing.Google Scholar
  80. Schneier, B. (1996). Applied cryptography: Protocols, algorithms, and source code in C (2nd ed.). New York: Wiley.Google Scholar
  81. Schnell, R., Bachteler, T., & Reiher, J. (2009). Privacy-preserving record linkage using Bloom filters. BioMed Central Medical Informatics and Decision Making, 9(1), 41.Google Scholar
  82. Sehili, Z., Kolb, L., Borgs, C., Schnell, R., & Rahm, E. (2015). Privacy preserving record linkage with PPJoin. In BTW Conference. Hamburg.Google Scholar
  83. Siegel, E. (2013). Predictive analytics: The power to predict who will click, buy, lie, or die. New York: Wiley.Google Scholar
  84. Su, W., Wang, J., & Lochovsky, F. H. (2009). Record matching over query results from multiple web databases. IEEE Transactions on Knowledge and Data Engineering, 22(4), 578–589.CrossRefGoogle Scholar
  85. Sweeney, L. (2002). K-anonymity: A model for protecting privacy. International Journal of Uncertainty Fuzziness and Knowledge Based Systems, 10(5), 557–570.CrossRefGoogle Scholar
  86. Talburt, J.R. (2011). Entity resolution and information quality. Morgan Kaufmann.Google Scholar
  87. Toxen, B. (2014). The NSA and Snowden: Securing the all-seeing eye. Communications of the ACM, 57(5), 44–51.CrossRefGoogle Scholar
  88. Trepetin, S. (2008). Privacy-preserving string comparisons in record linkage systems: a review. Information Security Journal: A Global Perspective, 17(5), 253–266.Google Scholar
  89. Vatsalan, D., & Christen, P. (2012). An iterative two-party protocol for scalable privacy-preserving record linkage. In AusDM, CRPIT (Vol. 134). Sydney, Australia.Google Scholar
  90. Vatsalan, D., & Christen, P. (2014). Scalable privacy-preserving record linkage for multiple databases. In ACM CIKM. Shanghai.Google Scholar
  91. Vatsalan, D., Christen, P., O’Keefe, C. M., & Verykios, V. S. (2014). An evaluation framework for privacy-preserving record linkage. Journal of Privacy and Confidentiality, 6(1), 3.Google Scholar
  92. Vatsalan, D., Christen, P., & Verykios, V. S. (2011). An efficient two-party protocol for approximate matching in private record linkage. In AusDM, CRPIT (Vol. 121). Ballarat, Australia.Google Scholar
  93. Vatsalan, D., Christen, P., & Verykios, V. S. (2013a). Efficient two-party private blocking based on sorted nearest neighborhood clustering. In ACM CIKM (pp. 1949–1958). San Francisco.Google Scholar
  94. Vatsalan, D., Christen, P., & Verykios, V. S. (2013b). A taxonomy of privacy-preserving record linkage techniques. Information Systems, 38(6), 946–969.CrossRefGoogle Scholar
  95. Verykios, V. S., & Christen, P. (2013). Privacy-preserving record linkage. Wiley Interdisciplinary reviews: Data Mining and Knowledge Discovery, 3(5), 321–332.Google Scholar
  96. Verykios, V. S., Karakasidis, A., & Mitrogiannis, V. K. (2009). Privacy preserving record linkage approaches. International Journal of Data Mining, Modelling and Management, 1(2), 206–221.Google Scholar
  97. Winkler, W. E. (2006). Overview of record linkage and current research directions. Technical Report RR2006/02, US Bureau of the Census, Washington, DC.Google Scholar
  98. Yakout, M., Atallah, M. J., & Elmagarmid, A. K. (2009). Efficient private record linkage. In IEEE ICDE (pp. 1283–1286). Shanghai.Google Scholar
  99. Yan, S., Lee, D., Kan, M. Y., & Giles, C. L. (2007). Adaptive sorted neighborhood methods for efficient record linkage. In ACM/IEEE-CS joint conference on Digital Libraries (pp. 185–194). Vancouver.Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Research School of Computer ScienceThe Australian National UniversityCanberraAustralia

Personalised recommendations