Skip to main content

Advanced Record Linkage Methods and Privacy Aspects for Population Reconstruction—A Survey and Case Studies

  • Chapter
  • First Online:
Population Reconstruction

Abstract

Recent times have seen an increased interest into techniques that allow the linking of records across databases. The main challenges of record linkage are (1) scalability to the increasingly large databases common today; (2) accurate and efficient classification of compared records into matches and non-matches in the presence of variations and errors in the data; and (3) privacy issues that occur when the linking of records is based on sensitive personal information about individuals. The first challenge has been addressed by the development of scalable indexing techniques, the second through advanced classification techniques that either employ machine learning- or graph-based methods, and the third challenge is investigated by research into privacy-preserving record linkage (PPRL). In this chapter, we describe these major challenges of record linkage in the context of population reconstruction. We survey recent developments of advanced record linkage methods, discuss two real-world case studies, and provide directions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Al-Lawati, A., Lee, D., & McDaniel, P. (2005). Blocking-aware private record linkage. In International Workshop on Information Quality in Information Systems (pp. 59–68). Baltimore.

    Google Scholar 

  • Antonie, L., Inwood, K., Lizotte, D. J., & Ross, J. A. (2014a). Tracking people over time in 19th century Canada for longitudinal analysis. Machine Learning, 95, 129–146.

    Article  Google Scholar 

  • Antonie, L., Inwood, K., & Ross, A. (2014b). Dancing with dirty data: Problems in the extraction of life-course evidence from historical censuses. In Population Reconstruction.

    Google Scholar 

  • Arasu, A., Götz, M., & Kaushik, R. (2010). On active learning of record matching packages. In ACM SIGMOD (pp. 783–794). Indianapolis.

    Google Scholar 

  • Atallah, M. J., Kerschbaum, F., & Du, W. (2003). Secure and private sequence comparisons. In ACM Workshop on Privacy in the Electronic Society (pp. 39–44). Washington, DC.

    Google Scholar 

  • Baffour, B., King, T., & Valente, P. (2013). The modern census: Evolution, examples and evaluation. International Statistical Review, 81(3), 407–425.

    Article  Google Scholar 

  • Bellare, K., Iyengar, S., Parameswaran, A. G., & Rastogi, V. (2012). Active sampling for entity matching. In ACM SIGKDD (pp. 1131–1139). Beijing.

    Google Scholar 

  • Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 5.

    Google Scholar 

  • Bilenko, M., Kamath, B., & Mooney, R. J. (2006). Adaptive blocking: Learning to scale up record linkage. In IEEE ICDM (pp. 87–96). Hong Kong.

    Google Scholar 

  • Block, W. C., & Star, D. L. (1995). Data entry and verification. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 28(1), 63–65.

    Article  Google Scholar 

  • Bloothooft, G. (1995). Multi-source family reconstruction. History and computing, 7(2), 90–103.

    Article  Google Scholar 

  • Bonomi, L., Xiong, L., Chen, R., & Fung, B. (2012). Frequent grams based embedding for privacy preserving record linkage. In CIKM (pp. 1597–1601). Maui, Hawaii.

    Google Scholar 

  • Chiang, Y. H., Doan, A., & Naughton, J. F. (2014). Tracking entities in the dynamic world: A fast algorithm for matching temporal records. PVLDB, 7(6).

    Google Scholar 

  • Christen, P. (2006). A comparison of personal name matching: Techniques and practical issues. In Workshop on Mining Complex Data, held at IEEE ICDM. Hong Kong.

    Google Scholar 

  • Christen, P. (2012a). Data Matching—Concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications. Berlin: Springer.

    Google Scholar 

  • Christen, P. (2012b). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9), 1537–1555.

    Article  Google Scholar 

  • Christen, P. (2014). Advanced record linkage methods and privacy aspects for population reconstruction. In Population Reconstruction.

    Google Scholar 

  • Christen, P., & Gayler, R.W. (2013). Adaptive temporal entity resolution on dynamic databases. In PAKDD (Vol. 7819, pp. 558–569). Gold Coast, Australia: Springer.

    Google Scholar 

  • Christen, P., Gayler, R. W., & Hawking, D. (2009). Similarity-aware indexing for real-time entity resolution. In ACM CIKM (pp. 1565–1568). Hong Kong.

    Google Scholar 

  • Christen, P., & Vatsalan, D. (2013). Flexible and extensible generation and corruption of personal data. In ACM CIKM (pp. 1165–1168). San Francisco.

    Google Scholar 

  • Christen, P., Vatsalan, D., & Verykios, V. S. (2014). Challenges for privacy preservation in data integration. ACM Journal Data and Information Quality, 5(1–2), 4.

    Google Scholar 

  • Churches, T. (2003). A proposed architecture and method of operation for improving the protection of privacy and confidentiality in disease registers. BMC Med Res Methodol, 3(1), 1.

    Google Scholar 

  • Churches, T., Christen, P., Lim, K., & Zhu, J. X. (2002). Preparation of name and address data for record linkage using hidden Markov models. BMC Med Inform Decis Mak, 2, 9.

    Google Scholar 

  • Dey, D., Mookerjee, V. S., & Liu, D. (2010). Efficient techniques for online record linkage. IEEE Transactions on Knowledge and Data Engineering, 23(3), 373–387.

    Article  Google Scholar 

  • de Vries, T., Ke, H., Chawla, S., & Christen, P. (2011). Robust record linkage blocking using suffix arrays and Bloom filters. ACM Transactions on Knowledge Discovery from Data, 5(2), 9.

    Google Scholar 

  • Dong, X. L., Halevy, A., & Madhavan, J. (2005). Reference reconciliation in complex information spaces. In ACM SIGMOD (pp. 85–96). Baltimore.

    Google Scholar 

  • Draisbach, U., Naumann, F., Szott, S., & Wonneberg, O. (2012). Adaptive windows for duplicate detection. In IEEE ICDE (pp. 1073–1083). Washington, DC.

    Google Scholar 

  • Durham, E.A. (2012). A framework for accurate, efficient private record linkage. Ph.D. thesis, Faculty of the Graduate School of Vanderbilt University, Nashville, TN.

    Google Scholar 

  • Durham, E. A., Xue, Y., Kantarcioglu, M., & Malin, B. (2012). Quantifying the correctness, computational complexity, and security of privacy-preserving string comparators for record linkage. Information Fusion, 13(4), 245–259.

    Article  Google Scholar 

  • Dwork, C. (2006). Differential privacy. Automata, languages and programming (pp. 1–12).

    Google Scholar 

  • Efremova, J., Ranjbar-Sahraei, B., Oliehoek, F. A., Calders, T., & Tuyls, K. (2015). A baseline method for genealogical entity resolution. In: G. Bloothooft, P. Christen, K. Mandemakers, M. Schraagen (Eds.), Population reconstruction. Berlin: Springer.

    Google Scholar 

  • Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.

    Article  Google Scholar 

  • Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.

    Article  Google Scholar 

  • Fu, Z., Boot, M., Christen, P., & Zhou, J. (2014a). Automatic record linkage of individuals and households in historical census data. International Journal of Humanities and Arts Computing, 8(2), 204–225.

    Google Scholar 

  • Fu, Z., Christen, P., & Zhou, J. (2014b). A graph matching method for historical census household linkage. In PAKDD (Vol. 8443, pp. 485–496). Tainan, Taiwan: Springer.

    Google Scholar 

  • Fu, Z., Christen, P., & Boot, M. (2011a). Automatic cleaning and linking of historical census data using household information. In Workshop on Domain Driven Data Mining, held at IEEE ICDM. Vancouver.

    Google Scholar 

  • Fu, Z., Christen, P., & Boot, M. (2011b). A supervised learning and group linking method for historical census household linkage. In AusDM, CRPIT (Vol. 121). Ballarat, Australia.

    Google Scholar 

  • Fu, Z., Zhou, J., Christen, P., & Boot, M. (2012) Multiple instance learning for group record linkage. In PAKDD (Vol. 7301, pp. 171–182). Kuala Lumpur, Malaysia: Springer.

    Google Scholar 

  • Fure, E. (2000). Interactive record linkage: The cumulative construction of life courses. Demographic Research, 3(11), 3–11.

    Google Scholar 

  • Glasson, E., De Klerk, N., Bass, J., Rosman, D., Palmer, L. J., & Holman, D. (2008). Cohort profile: The Western Australian family connections genealogical project. International Journal of Epidemiology, 37(1), 30–35.

    Article  Google Scholar 

  • Hernandez, M. A., & Stolfo, S. J. (1995). The merge/purge problem for large databases. In ACM SIGMOD (pp. 127–138). San Jose.

    Google Scholar 

  • Herzog, T. N., Scheuren, F. J., & Winkler, W. E. (2007). Data quality and record linkage techniques. Berlin: Springer.

    Google Scholar 

  • Inan, A., Kantarcioglu, M., Bertino, E., & Scannapieco, M. (2008). A hybrid approach to private record linkage. In IEEE ICDE (pp. 496–505). Cancun, Mexico.

    Google Scholar 

  • Inan, A., Kantarcioglu, M., Ghinita, G., & Bertino, E. (2010). Private record matching using differential privacy. In EDBT (pp. 123–134). Lausanne, Switzerland.

    Google Scholar 

  • Ioannou, E., Nejdl, W., Niederée, C., & Velegrakis, Y. (2010). On-the-fly entity-aware query processing in the presence of linkage. VLDB Endowment, 3(1), 429–438.

    Google Scholar 

  • Jin, L., Li, C., & Mehrotra, S. (2003). Efficient record linkage in large data sets. In DASFAA (pp. 137–146). Tokyo.

    Google Scholar 

  • Jonas, J., & Harper, J. (2006). Effective counterterrorism and the limited role of predictive data mining. Policy Analysis (584) (2006).

    Google Scholar 

  • Kalashnikov, D. V., & Mehrotra, S. (2006). Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems, 31(2), 716–767.

    Article  Google Scholar 

  • Karakasidis, A., & Verykios, V. S. (2009). Privacy preserving record linkage using phonetic codes. In Fourth Balkan Conference in Informatics, IEEE (pp. 101–106). Thessaloniki, Greece.

    Google Scholar 

  • Karakasidis, A., & Verykios, V. S. (2010). Advances in privacy preserving record linkage. In E-activity and Innovative Technology, Advances in Applied Intelligence Technologies Book Series (pp. 22–34). IGI Global.

    Google Scholar 

  • Karakasidis, A., & Verykios, V. S. (2012). Reference table based k-anonymous private blocking. In ACM Symposium on Applied Computing (pp. 859–864). Trento, Italy.

    Google Scholar 

  • Karakasidis, A., Verykios, V. S., & Christen, P. (2011). Fake injection strategies for private phonetic matching. In International Workshop on Data Privacy Management. Leuven, Belgium.

    Google Scholar 

  • Karapiperis, D., & Verykios, V. S. (2014). An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. IEEE Transactions on Knowledge and Data Engineering.

    Google Scholar 

  • Kejriwal, M., & Miranker, D. P. (2013). An unsupervised algorithm for learning blocking schemes. In IEEE ICDM (pp. 340–349).

    Google Scholar 

  • Kelman, C. W., Bass, J., & Holman, D. (2002). Research use of linked health data—A best practice protocol. Aust NZ Journal of Public Health, 26, 251–255.

    Article  Google Scholar 

  • Köpcke, H., & Rahm, E. (2010). Frameworks for entity matching: A comparison. Data and Knowledge Engineering, 69(2), 197–210.

    Article  Google Scholar 

  • Kum, H. C., Krishnamurthy, A., Machanavajjhala, A., & Ahalt, S. (2013). Population informatics: Tapping the social genome to advance society: A vision for putting ‘Big Data’ to work for population informatics. Computer, PP(99).

    Google Scholar 

  • Kuzu, M., Kantarcioglu, M., Inan, A., Bertino, E., Durham, E., & Malin, B. (2013). Efficient privacy-aware record integration. In EDBT (pp. 167–178). Genoa, Italy.

    Google Scholar 

  • Lee, D., Kang, J., Mitra, P., Giles, C. L., & On, B. W. (2007). Are your citations clean? Commununications of the ACM, 50, 33–38.

    Article  Google Scholar 

  • Li, F., Chen, Y., Luo, B., Lee, D., & Liu, P. (2011). Privacy preserving group linkage. In SSDBM (Vol. 6809, pp. 432–450). Portland: Springer LNCS.

    Google Scholar 

  • Li, P., Dong, X. L., Maurino, A., & Srivastava, D. (2011). Linking temporal records. VLDB Endowment, 4(11), 956–967.

    Google Scholar 

  • Lindell, Y., & Pinkas, B. (2009). Secure multiparty computation for privacy-preserving data mining. Journal of Privacy and Confidentiality, 1(1), 5.

    Google Scholar 

  • Michelson, M., & Knoblock, C. A. (2006). Learning blocking schemes for record linkage. In AAAI. Boston.

    Google Scholar 

  • Naumann, F., & Herschel, M. (2010). An introduction to duplicate detection. Synthesis Lectures on Data Management (vol. 3). Morgan and Claypool Publishers.

    Google Scholar 

  • Newcombe, H. B. (1988). Handbook of record linkage: Methods for health and statistical studies, administration, and business. New York: Oxford University Press Inc.

    Google Scholar 

  • Newcombe, H. B., & Kennedy, J. M. (1962). Record linkage: making maximum use of the discriminating power of identifying information. Communications of the ACM, 5(11), 563–566.

    Article  Google Scholar 

  • Newton, G. (2013). Family reconstitution in an urban context: Some observations and methods. Technical Report, University of Cambridge, CWPESH No. 12.

    Google Scholar 

  • Office for National Statistics. (2013). Beyond 2011 matching anonymous data. Methods and Policies Report M9.

    Google Scholar 

  • On, B. W., Koudas, N., Lee, D., & Srivastava, D. (2007). Group linkage. In IEEE ICDE (pp. 496–505). Istanbul.

    Google Scholar 

  • Pang, C., Gu, L., Hansen, D., & Maeder, A. (2009). Privacy-preserving fuzzy matching using a public reference table. Intelligent Patient Management, 189, 71–89.

    Google Scholar 

  • Quass, D., & Starkey, P. (2003). Record linkage for genealogical databases. In ACM SIGKDD Workshop on Data Cleaning, Record Linkage and Object Consolidation (pp. 40–42). Washington DC.

    Google Scholar 

  • Ramadan, B., Christen, P., & Liang, H. (2014). Dynamic sorted neighborhood indexing for real-time entity resolution. In ADC (Vol. 8506, pp. 1–12). Brisbane: Springer LNCS.

    Google Scholar 

  • Ranbaduge, T., Christen, P., & Vatsalan, D. (2014). Tree based scalable indexing for multi-party privacy-preserving record linkage. In AusDM, CRPIT (Vol. 158). Brisbane, Australia.

    Google Scholar 

  • Rastogi, V., Dalvi, N., & Garofalakis, M. (2011). Large-scale collective entity matching. VLDB Endowment, 4, 208–218.

    Google Scholar 

  • Ravikumar, P., Cohen, W., & Fienberg, S. (2004). A secure protocol for computing string distance metrics. In Workshop on Privacy and Security Aspects of Data Mining held at IEEE ICDM (pp. 40–46). Brighton, UK.

    Google Scholar 

  • Reid, A., Davies, R., & Garrett, E. (2002). Nineteenth-century scottish demography from linked censuses and civil registers: A’sets of related individuals’ approach. History and Computing, 14(1–2), 61–86.

    Article  Google Scholar 

  • Rudin, C., & Wagstaff, K. L. (2013). Machine learning for science and society. Machine Learning, 95(1), 1–9.

    Google Scholar 

  • Ruggles, S. (2002). Linking historical censuses: A new approach. History and Computing, 14(1–2), 213–224.

    Article  Google Scholar 

  • Scannapieco, M., Figotin, I., Bertino, E., & Elmagarmid, A. K. (2007). Privacy preserving schema and data matching. In ACM SIGMOD (pp. 653–664). Beijing.

    Google Scholar 

  • Schneier, B. (1996). Applied cryptography: Protocols, algorithms, and source code in C (2nd ed.). New York: Wiley.

    Google Scholar 

  • Schnell, R., Bachteler, T., & Reiher, J. (2009). Privacy-preserving record linkage using Bloom filters. BioMed Central Medical Informatics and Decision Making, 9(1), 41.

    Google Scholar 

  • Sehili, Z., Kolb, L., Borgs, C., Schnell, R., & Rahm, E. (2015). Privacy preserving record linkage with PPJoin. In BTW Conference. Hamburg.

    Google Scholar 

  • Siegel, E. (2013). Predictive analytics: The power to predict who will click, buy, lie, or die. New York: Wiley.

    Google Scholar 

  • Su, W., Wang, J., & Lochovsky, F. H. (2009). Record matching over query results from multiple web databases. IEEE Transactions on Knowledge and Data Engineering, 22(4), 578–589.

    Article  Google Scholar 

  • Sweeney, L. (2002). K-anonymity: A model for protecting privacy. International Journal of Uncertainty Fuzziness and Knowledge Based Systems, 10(5), 557–570.

    Article  Google Scholar 

  • Talburt, J.R. (2011). Entity resolution and information quality. Morgan Kaufmann.

    Google Scholar 

  • Toxen, B. (2014). The NSA and Snowden: Securing the all-seeing eye. Communications of the ACM, 57(5), 44–51.

    Article  Google Scholar 

  • Trepetin, S. (2008). Privacy-preserving string comparisons in record linkage systems: a review. Information Security Journal: A Global Perspective, 17(5), 253–266.

    Google Scholar 

  • Vatsalan, D., & Christen, P. (2012). An iterative two-party protocol for scalable privacy-preserving record linkage. In AusDM, CRPIT (Vol. 134). Sydney, Australia.

    Google Scholar 

  • Vatsalan, D., & Christen, P. (2014). Scalable privacy-preserving record linkage for multiple databases. In ACM CIKM. Shanghai.

    Google Scholar 

  • Vatsalan, D., Christen, P., O’Keefe, C. M., & Verykios, V. S. (2014). An evaluation framework for privacy-preserving record linkage. Journal of Privacy and Confidentiality, 6(1), 3.

    Google Scholar 

  • Vatsalan, D., Christen, P., & Verykios, V. S. (2011). An efficient two-party protocol for approximate matching in private record linkage. In AusDM, CRPIT (Vol. 121). Ballarat, Australia.

    Google Scholar 

  • Vatsalan, D., Christen, P., & Verykios, V. S. (2013a). Efficient two-party private blocking based on sorted nearest neighborhood clustering. In ACM CIKM (pp. 1949–1958). San Francisco.

    Google Scholar 

  • Vatsalan, D., Christen, P., & Verykios, V. S. (2013b). A taxonomy of privacy-preserving record linkage techniques. Information Systems, 38(6), 946–969.

    Article  Google Scholar 

  • Verykios, V. S., & Christen, P. (2013). Privacy-preserving record linkage. Wiley Interdisciplinary reviews: Data Mining and Knowledge Discovery, 3(5), 321–332.

    Google Scholar 

  • Verykios, V. S., Karakasidis, A., & Mitrogiannis, V. K. (2009). Privacy preserving record linkage approaches. International Journal of Data Mining, Modelling and Management, 1(2), 206–221.

    Google Scholar 

  • Winkler, W. E. (2006). Overview of record linkage and current research directions. Technical Report RR2006/02, US Bureau of the Census, Washington, DC.

    Google Scholar 

  • Yakout, M., Atallah, M. J., & Elmagarmid, A. K. (2009). Efficient private record linkage. In IEEE ICDE (pp. 1283–1286). Shanghai.

    Google Scholar 

  • Yan, S., Lee, D., Kan, M. Y., & Giles, C. L. (2007). Adaptive sorted neighborhood methods for efficient record linkage. In ACM/IEEE-CS joint conference on Digital Libraries (pp. 185–194). Vancouver.

    Google Scholar 

Download references

Acknowledgments

The authors would like to thank Mac Boot (The Australian National University) and Vassilios S. Verykios (Hellenic Open University) for their contributions to the work presented in this chapter.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter Christen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Christen, P., Vatsalan, D., Fu, Z. (2015). Advanced Record Linkage Methods and Privacy Aspects for Population Reconstruction—A Survey and Case Studies. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds) Population Reconstruction. Springer, Cham. https://doi.org/10.1007/978-3-319-19884-2_5

Download citation

Publish with us

Policies and ethics