Skip to main content

Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges

  • Chapter
  • First Online:
Handbook of Big Data Technologies

Abstract

The growth of Big Data, especially personal data dispersed in multiple data sources, presents enormous opportunities and insights for businesses to explore and leverage the value of linked and integrated data. However, privacy concerns impede sharing or exchanging data for linkage across different organizations. Privacy-preserving record linkage (PPRL) aims to address this problem by identifying and linking records that correspond to the same real-world entity across several data sources held by different parties without revealing any sensitive information about these entities. PPRL is increasingly being required in many real-world application areas. Examples range from public health surveillance to crime and fraud detection, and national security. PPRL for Big Data poses several challenges, with the three major ones being (1) scalability to multiple large databases, due to their massive volume and the flow of data within Big Data applications, (2) achieving high quality results of the linkage in the presence of variety and veracity of Big Data, and (3) preserving privacy and confidentiality of the entities represented in Big Data collections. In this chapter, we describe the challenges of PPRL in the context of Big Data, survey existing techniques for PPRL, and provide directions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.oaic.gov.au/privacy-law/other-legislation/government-data-matching [Accessed: 15/06/2016].

  2. 2.

    http://ec.europa.eu/justice/data-protection/index_en.htm [Accessed: 15/06/2016].

  3. 3.

    http://www.hhs.gov/ocr/privacy/ [Accessed: 15/06/2016].

  4. 4.

    http://spark.apache.org [Accessed: 15/06/2016].

  5. 5.

    https://flink.apache.org/ [Accessed: 15/06/2016].

References

  1. R. Agrawal, A. Evfimievski, R. Srikant, Information sharing across private databases, in ACM SIGMOD (2003), pp. 86–97

    Google Scholar 

  2. A. Arasu, V. Ganti, R. Kaushik, Efficient exact set-similarity joins, in PVLDB (2006), pp. 918–929

    Google Scholar 

  3. A. Arasu, M. Götz, R. Kaushik, On active learning of record matching packages, in ACM SIGMOD (2010), pp. 783–794

    Google Scholar 

  4. Y. Aumann, Y. Lindell, Security against covert adversaries: efficient protocols for realistic adversaries. J. Cryptol. 23(2), 281–343 (2010)

    Google Scholar 

  5. T. Bachteler, J. Reiher, and R. Schnell. Similarity Filtering with Multibit Trees for Record Linkage. Technical Report WP-GRLC-2013-01, German Record Linkage Center, 2013

    Google Scholar 

  6. D. Barone, A. Maurino, F. Stella, C. Batini, A privacy-preserving framework for accuracy and completeness quality assessment, in Emerging Paradigms in Informatics, Systems and Communication (2009), pp. 83–87

    Google Scholar 

  7. J.E. Barros, J.C. French, W.N. Martin, P.M. Kelly, T.M. Cannon, Using the triangle inequality to reduce the number of comparisons required for similarity-based retrieval, in Electronic Imaging Science and Technology (1996), pp. 392–403

    Google Scholar 

  8. C. Batini, M. Scannapieca, Data quality: Concepts, Methodologies And Techniques. Data-Centric Systems and Applications (Springer, Berlin, 2006)

    Google Scholar 

  9. R. Baxter, P. Christen, T. Churches, A comparison of fast blocking methods for record linkage, in SIGKDD Workshop on Data Cleaning, Record Linkage and Object Consolidation (2003), pp. 25–27

    Google Scholar 

  10. R.J. Bayardo, Y. Ma, R. Srikant, Scaling Up All Pairs Similarity Search, in WWW (2007), pp. 131–140

    Google Scholar 

  11. K. Bellare, S. Iyengar, A.G. Parameswaran, V. Rastogi, Active sampling for entity matching, in ACM SIGKDD (2012), pp. 1131–1139

    Google Scholar 

  12. A. Berman, L.G. Shapiro, Selecting good keys for triangle-inequality-based pruning algorithms, in IEEE Workshop on Content-Based Access of Image and Video Database (1998), pp. 12–19

    Google Scholar 

  13. I. Bhattacharya, L. Getoor, Collective entity resolution in relational data. ACM TKDD 1(1), 1–35 (2007)

    Google Scholar 

  14. M. Bilenko, R.J. Mooney, Adaptive duplicate detection using learnable string similarity measures, in ACM SIGKDD (2003), pp. 39–48

    Google Scholar 

  15. B. Bloom, Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  16. L. Bonomi, L. Xiong, R. Chen, B. Fung, Frequent grams based embedding for privacy preserving record linkage, in ACM CIKM (2012), pp. 1597–1601

    Google Scholar 

  17. H. Bouzelat, C. Quantin, L. Dusserre, Extraction and anonymity protocol of medical file, in AMIA Fall Symposium (1996), pp. 323–327

    Google Scholar 

  18. A.Z. Broder, On the resemblance and containment of documents, in Compression and Complexity of Sequences. IEEE (1997), pp. 21–29

    Google Scholar 

  19. A. Broder, M. Mitzenmacher, A. Mitzenmacher, Network applications of Bloom filters: a survey. Internet Math. 1(4), 485–509 (2004)

    Google Scholar 

  20. E. Brook, D. Rosman, C. Holman, Public good through data linkage: measuring research outputs from the Western Australian data linkage system. Aust. NZ J. Public Health 32, 19–23 (2008)

    Article  Google Scholar 

  21. R. Canetti, Security and composition of multiparty cryptographic protocols. J. Cryptol. 13(1), 143–202 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  22. A. Cavoukian, J. Jonas, Privacy by design in the age of Big Data. Technical report, TR Information and privacy commissioner, Ontario (2012)

    Google Scholar 

  23. P. Christen, A comparison of personal name matching: techniques and practical issues, in IEEE ICDM Workshop on Mining Complex Data (2006), pp. 290–294

    Google Scholar 

  24. P. Christen, Privacy-preserving data linkage and geocoding: current approaches and research directions, in IEEE ICDM Workshop on Privacy Aspects of Data Mining (2006), pp. 497–501

    Google Scholar 

  25. P. Christen, Automatic record linkage using seeded nearest neighbour and support vector machine classification, in ACM SIGKDD (2008), pp. 151–159

    Google Scholar 

  26. P. Christen, Febrl: an open source data cleaning, deduplication and record linkage system with a graphical user interface, in ACM SIGKDD (2008), pp. 1065–1068

    Google Scholar 

  27. P. Christen, Geocode matching and privacy preservation, in Workshop on Privacy, Security, and Trust in KDD (Springer, Berlin, 2009), pp. 7–24

    Google Scholar 

  28. P. Christen, Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection (Springer, Berlin, 2012)

    Google Scholar 

  29. P. Christen, A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE 24(9), 1537–1555 (2012)

    Google Scholar 

  30. P. Christen, T. Churches, M. Hegland, Febrl – a parallel open source data linkage system, in Springer PAKDD (2004), pp. 638–647

    Google Scholar 

  31. P. Christen, K. Goiser, Quality and complexity measures for data linkage and deduplication, in Quality Measures in Data Mining, vol. 43. Studies in Computational Intelligence (Springer, Berlin, 2007), pp. 127–151

    Google Scholar 

  32. P. Christen, R. Gayler, D. Hawking, Similarity-aware indexing for real-time entity resolution, in ACM CIKM (2009), pp. 1565–1568

    Google Scholar 

  33. P. Christen, R.W. Gayler, Adaptive temporal entity resolution on dynamic databases, in PAKDD (2013), pp. 558–569

    Google Scholar 

  34. P. Christen, D. Vatsalan, Flexible and extensible generation and corruption of personal data, in ACM CIKM (2013), pp. 1165–1168

    Google Scholar 

  35. T. Churches, P. Christen, Some methods for blindfolded record linkage. BioMed Cent. Med. Inf. Decision Mak. 4(9), (2004)

    Google Scholar 

  36. T. Churches, P. Christen, K. Lim, J.X. Zhu, Preparation of name and address data for record linkage using hidden Markov models. BioMed Cent. Med. Inf. Decision Mak. 2(9), (2002)

    Google Scholar 

  37. D.E. Clark, Practical introduction to record linkage for injury research. Inj. Prev. 10, 186–191 (2004)

    Article  Google Scholar 

  38. C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, M. Zhu, Tools for privacy preserving distributed data mining. SIGKDD Explor. 4(2), 28–34 (2002)

    Article  Google Scholar 

  39. W.W. Cohen, Data integration using similarity joins and a word-based information representation language. ACM TOIS 18(3), 288–321 (2000)

    Article  MathSciNet  Google Scholar 

  40. W.W. Cohen, J. Richman, Learning to match and cluster large high-dimensional data sets for data integration, in ACM SIGKDD (2002), pp. 475–480

    Google Scholar 

  41. G. Cormode, S. Muthukrishnan, An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  42. G. Dal Bianco, R. Galante, C.A. Heuser, A fast approach for parallel deduplication on multicore processors, in ACM Symposium on Applied Computing (2011), pp. 1027–1032

    Google Scholar 

  43. D. Dey, V. Mookerjee, D. Liu, Efficient techniques for online record linkage. IEEE TKDE 23(3), 373–387 (2010)

    Google Scholar 

  44. W. Du, M. Atallah, Protocols for secure remote database access with approximate matching, in ACM WSPEC (Springer, Berlin, 2000), pp. 87–111

    Google Scholar 

  45. G.T. Duncan, M. Elliot, J.-J. Salazar-González, Statistical Confidentiality: Principles and Practice (Springer, New York, 2011)

    Google Scholar 

  46. E. Durham, A framework for accurate, efficient private record linkage. Ph.D. thesis, Faculty of the Graduate School of Vanderbilt University, Nashville, TN, 2012

    Google Scholar 

  47. E. Durham, Y. Xue, M. Kantarcioglu, B. Malin, Private medical record linkage with approximate matching, in AMIA Annual Symposium (2010), pp. 182–186

    Google Scholar 

  48. E.A. Durham, C. Toth, M. Kuzu, M. Kantarcioglu, Y. Xue, B. Malin, Composite Bloom filters for secure record linkage. IEEE TKDE 26(12), pp. 2956–2968 (2013)

    Google Scholar 

  49. L. Dusserre, C. Quantin, H. Bouzelat, A one way public key cryptosystem for the linkage of nominal files in epidemiological studies. Medinfo 8, 644–647 (1995)

    Google Scholar 

  50. C. Dwork, Differential privacy, in ICALP (2006), pp. 1–12

    Google Scholar 

  51. M.G. Elfeky, V.S. Verykios, A.K. Elmagarmid, TAILOR: a record linkage toolbox, in IEEE ICDE (2002), pp. 17–28

    Google Scholar 

  52. A. Elmagarmid, P. Ipeirotis, V.S. Verykios, Duplicate record detection: a survey. IEEE TKDE 19(1), 1–16 (2007)

    Google Scholar 

  53. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, Advances in Knowledge Discovery and Data Mining (The MIT Press, Cambridge, 1996)

    Google Scholar 

  54. I.P. Fellegi, A.B. Sunter, A theory for record linkage. J. Am. Stat. Soc. 64(328), 1183–1210 (1969)

    Article  MATH  Google Scholar 

  55. S.E. Fienberg, Confidentiality and disclosure limitation. Encycl. Soc. Meas. 1, 463–469 (2005)

    Article  Google Scholar 

  56. B. Forchhammer, T. Papenbrock, T. Stening, S. Viehmeier, U. Draisbach, F. Naumann, Duplicate detection on GPUs, in BTW (2013), pp. 165–184

    Google Scholar 

  57. M. Freedman, Y. Ishai, B. Pinkas, O. Reingold, Keyword search and oblivious pseudorandom functions, in Theory of Cryptography (2005), pp. 303–324

    Google Scholar 

  58. Z. Fu, J. Zhou, P. Christen, M. Boot, Multiple instance learning for group record linkage, in PAKDD, Springer LNAI (2012), pp. 171–182

    Google Scholar 

  59. B. Fung, K. Wang, R. Chen, P.S. Yu, Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. 42(4), 14 (2010)

    Article  Google Scholar 

  60. S.R. Ganta, S.P. Kasiviswanathan, A. Smith, Composition attacks and auxiliary information in data privacy, in ACM SIGKDD (2008), pp. 265–273

    Google Scholar 

  61. A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in VLDB (1999), pp. 518–529

    Google Scholar 

  62. O. Goldreich, Foundations of Cryptography: Basic Applications, vol. 2. (Cambridge University Press, Cambridge, 2004)

    Google Scholar 

  63. L. Gu, R. Baxter, Decision models for record linkage, in Selected Papers from AusDM. LNCS, vol. 3755 (Springer, Berlin, 2006), pp. 146–160

    Google Scholar 

  64. M. Hadjieleftheriou, A. Chandel, N. Koudas, D. Srivastava, Fast indexes and algorithms for set similarity selection queries, in IEEE ICDE (2008), pp. 267–276

    Google Scholar 

  65. R. Hall, S. Fienberg, Privacy-preserving record linkage, in PSD (2010), pp. 269–283

    Google Scholar 

  66. M. Herschel, F. Naumann, S. Szott, M. Taubert, Scalable iterative graph duplicate detection. IEEE TKDE 24(11), 2094–2108 (2012)

    Google Scholar 

  67. A. Inan, M. Kantarcioglu, E. Bertino, M. Scannapieco, A hybrid approach to private record linkage, in IEEE ICDE (2008), pp. 496–505

    Google Scholar 

  68. A. Inan, M. Kantarcioglu, G. Ghinita, E. Bertino. Private record matching using differential privacy, in EDBT (2010), pp. 123–134

    Google Scholar 

  69. P. Indyk, R. Motwani, Approximate nearest neighbors: Towards removing the curse of dimensionality, in ACM Symposium on the Theory of Computing (1998), pp. 604–613

    Google Scholar 

  70. E. Ioannou, W. Nejdl, C. Niederée, Y. Velegrakis, On-the-fly entity-aware query processing in the presence of linkage. PVLDB 3(1–2), 429–438 (2010)

    Google Scholar 

  71. W. Jiang, C. Clifton, Ac-framework for privacy-preserving collaboration, in SDM SIAM (2007), pp. 47–56

    Google Scholar 

  72. W. Jiang, C. Clifton, M. Kantarcıoğlu, Transforming semi-honest protocols to ensure accountability. Elsevier DKE 65(1), 57–74 (2008)

    Article  Google Scholar 

  73. J. Jonas, J. Harper, Effective counterterrorism and the limited role of predictive data mining. Policy Anal. 584, 1–12 (2006)

    Google Scholar 

  74. D. Kalashnikov, S. Mehrotra, Domain-independent data cleaning via analysis of entity-relationship graph. ACM TODS 31(2), 716–767 (2006)

    Article  Google Scholar 

  75. M. Kantarcioglu, W. Jiang, B. Malin, A privacy-preserving framework for integrating person-specific databases, in PSD (2008), pp. 298–314

    Google Scholar 

  76. A. Karakasidis, V.S. Verykios, Secure blocking\(+\)secure matching \(=\) secure record linkage. JCSE 5, 223–235 (2011)

    Google Scholar 

  77. A. Karakasidis, V.S. Verykios, Reference table based k-anonymous private blocking, in ACM SAC (2012), pp. 859–864

    Google Scholar 

  78. A. Karakasidis, V.S. Verykios, A sorted neighborhood approach to multidimensional privacy preserving blocking, in IEEE ICDMW (2012), pp. 937–944

    Google Scholar 

  79. A. Karakasidis, V.S. Verykios, P. Christen, Fake injection strategies for private phonetic matching. DPM Springer 7122, 9–24 (2012)

    Google Scholar 

  80. D. Karapiperis, D. Vatsalan, V.S. Verykios, P. Christen, Large-scale multi-party counting set intersection using a space efficient global synopsis, in DASFAA (2015), pp. 329–345

    Google Scholar 

  81. D. Karapiperis, D. Vatsalan, V.S. Verykios, P. Christen, Efficient record linkage using a compact hamming space, in EDBT (2016), pp. 209–220

    Google Scholar 

  82. D. Karapiperis, V.S. Verykios, A distributed framework for scaling up LSH-based computations in privacy preserving record linkage, in ACM BCI (2013), pp. 102–109

    Google Scholar 

  83. D. Karapiperis, V.S. Verykios, A distributed near-optimal LSH-based framework for privacy-preserving record linkage. ComSIS 11(2), 745–763 (2014)

    Article  Google Scholar 

  84. D. Karapiperis, V.S. Verykios, An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. IEEE TKDE 27(4), 909–921 (2015)

    Google Scholar 

  85. D. Karapiperis, V.S. Verykios, A fast and efficient hamming LSH-based scheme for accurate linkage, in Springer KAIS (2016), pp. 1–24

    Google Scholar 

  86. H. Kargupta, S. Datta, Q. Wang, K. Sivakumar, On the privacy preserving properties of random data perturbation techniques, in IEEE ICDM (2003), p. 99

    Google Scholar 

  87. H. Kargupta, S. Datta, Q. Wang, K. Sivakumar, Random-data perturbation techniques and privacy-preserving data mining, Springer KAIS 7(4), 387–414 (2005)

    Google Scholar 

  88. C.W. Kelman, J. Bass, D. Holman, Research use of linked health data - a best practice protocol. Aust. NZ J. Public Health 26, 251–255 (2002)

    Article  Google Scholar 

  89. H. Kim, D. Lee, Harra: fast iterative hashed record linkage for large-scale data collections, in EDBT (2010), pp. 525–536

    Google Scholar 

  90. H.-s. Kim, D. Lee, Parallel linkage, in ACM CIKM (2007), pp. 283–292

    Google Scholar 

  91. T. Kirsten, L. Kolb, M. Hartung, A. Groß, H. Köpcke, E. Rahm, Data partitioning for parallel entity matching, in QDB (2010)

    Google Scholar 

  92. L. Kissner, D. Song, Private and threshold set-intersection, in Technical Report. Carnegie Mellon University, 2004

    Google Scholar 

  93. L. Kolb, A. Thor, E. Rahm, Dedoop: efficient deduplication with Hadoop. PVLDB 5(12), 1878–1881 (2012)

    Google Scholar 

  94. L. Kolb, A. Thor, E. Rahm, Load balancing for mapreduce-based entity resolution, in IEEE ICDE (2012), pp. 618–629

    Google Scholar 

  95. H. Köpcke, E. Rahm, Frameworks for entity matching: a comparison. Elsevier DKE 69(2), 197–210 (2010)

    Article  Google Scholar 

  96. H. Köpcke, A. Thor, E. Rahm, Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1), 484–493 (2010)

    Google Scholar 

  97. H. Krawczyk, M. Bellare, R. Canetti, HMAC: keyed-hashing for message authentication, in Internet RFCs (1997)

    Google Scholar 

  98. T.G. Kristensen, J. Nielsen, C.N. Pedersen, A tree-based method for the rapid screening of chemical fingerprints. Algorithms Mol. Biol. 5(1), 9 (2010)

    Article  Google Scholar 

  99. H. Kum, A. Krishnamurthy, A. Machanavajjhala, S. Ahalt, Population informatics: tapping the social genome to advance society: a vision for putting “big data” to work for population informatics. Computer (2013)

    Google Scholar 

  100. H.-C. Kum, A. Krishnamurthy, A. Machanavajjhala, M.K. Reiter, S. Ahalt, Privacy preserving interactive record linkage. JAMIA 21(2), 212–220 (2014)

    Google Scholar 

  101. M. Kuzu, M. Kantarcioglu, E. Durham, B. Malin, A constraint satisfaction cryptanalysis of Bloom filters in private record linkage. PETS Springer LNCS 6794, 226–245 (2011)

    Google Scholar 

  102. M. Kuzu, M. Kantarcioglu, E.A. Durham, C. Toth, B. Malin, A practical approach to achieve private medical record linkage in light of public resources. JAMIA 20(2), 285–292 (2013)

    Google Scholar 

  103. M. Kuzu, M. Kantarcioglu, A. Inan, E. Bertino, E. Durham, B. Malin, Efficient privacy-aware record integration, in ACM EDBT (2013), pp. 167–178

    Google Scholar 

  104. P. Lai, S. Yiu, K. Chow, C. Chong, L. Hui, An efficient Bloom filter based solution for multiparty private matching, in SAM (2006)

    Google Scholar 

  105. F. Li, Y. Chen, B. Luo, D. Lee, P. Liu, Privacy preserving group linkage, in Scientific and Statistical Database Management (Springer, Berlin, 2011), pp. 432–450

    Google Scholar 

  106. N. Li, T. Li, S. Venkatasubramanian, T-closeness: privacy beyond k-anonymity and l-diversity, in IEEE ICDE (2007), pp. 106–115

    Google Scholar 

  107. P. Li, X. Dong, A. Maurino, D. Srivastava, Linking temporal records. PVLDB 4(11), 956–967 (2011)

    Google Scholar 

  108. Z. Lin, M. Hewett, R.B. Altman, Using binning to maintain confidentiality of medical data, in AMIA Symposium (2002), p. 454

    Google Scholar 

  109. Y. Lindell, B. Pinkas, Privacy preserving data mining, in CRYPTO (Springer, Berlin, 2000), pp. 36–54

    Google Scholar 

  110. Y. Lindell, B. Pinkas, An efficient protocol for secure two-party computation in the presence of malicious adversaries, in EUROCRYPT (2007), pp. 52–78

    Google Scholar 

  111. Y. Lindell, B. Pinkas, Secure multiparty computation for privacy-preserving data mining. JPC 1(1), 5 (2009), pp. 59–98

    Google Scholar 

  112. H. Liu, H. Wang, Y. Chen, Ensuring data storage security against frequency-based attacks in wireless networks, in DCOSS, Springer LNCS, vol. 6131 (2010), pp. 201–215

    Google Scholar 

  113. H. Lu, M.-C. Shan, K.-L. Tan, Optimization of multi-way join queries for parallel execution, in VLDB (1991), pp. 549–560

    Google Scholar 

  114. M. Luby, C. Rackoff, How to construct pseudo-random permutations from pseudo-random functions, in CRYPTO, vol. 85 (1986), p. 447

    Google Scholar 

  115. A. Machanavajjhala, D. Kifer, J. Gehrke, M. Venkitasubramaniam, l-diversity: privacy beyond k-anonymity. ACM TKDD 1(1), 3 (2007)

    Article  Google Scholar 

  116. B.A. Malin, K. El Emam, C.M. O’Keefe, Biomedical data privacy: problems, perspectives, and recent advances. JAMIA 20(1), 2–6 (2013)

    Google Scholar 

  117. M. Mitzenmacher, E. Upfal, Probability and Computing: Randomized Algorithms and Probabilistic Analysis (Cambridge University Press, Cambridge, 2005)

    Google Scholar 

  118. N. Mohammed, B. Fung, M. Debbabi, Anonymity meets game theory: secure data integration with malicious participants. PVLDB 20(4), 567–588 (2011)

    Google Scholar 

  119. M. Nentwig, M. Hartung, A.-C. Ngonga Ngomo, E. Rahm, A survey of current link discovery frameworks. Semantic Web Journal (2016)

    Google Scholar 

  120. A.N. Ngomo, L. Kolb, N. Heino, M. Hartung, S. Auer, E. Rahm, When to reach for the cloud: using parallel hardware for link discovery, in ESWC (2013), pp. 275–289

    Google Scholar 

  121. Office for National Statistics, Beyond 2011 matching anonymous data (2013)

    Google Scholar 

  122. C. O’Keefe, M. Yung, L. Gu, R. Baxter, Privacy-preserving data linkage protocols, in ACM WPES (2004), pp. 94–102

    Google Scholar 

  123. B. On, N. Koudas, D. Lee, D. Srivastava, Group linkage, in IEEE ICDE (2007), pp. 496–505

    Google Scholar 

  124. C. Pang, L. Gu, D. Hansen, A. Maeder, Privacy-preserving fuzzy matching using a public reference table, in Intelligent Patient Management, vol. 189. Studies in Computational Intelligence (Springer, Berlin, 2009), pp. 71–89

    Google Scholar 

  125. C. Phua, K. Smith-Miles, V. Lee, R. Gayler, Resilient identity crime detection. IEEE TKDE 24(3), 533–546 (2012)

    Google Scholar 

  126. C. Quantin, H. Bouzelat, L. Dusserre, Irreversible encryption method by generation of polynomials. Med. Inf. Internet Med. 21(2), 113–121 (1996)

    Google Scholar 

  127. C. Quantin, H. Bouzelat, F. Allaert, A. Benhamiche, J. Faivre, L. Dusserre, How to ensure data security of an epidemiological follow-up: quality assessment of an anonymous record linkage procedure. IJMI 49(1), 117–122 (1998)

    Google Scholar 

  128. E. Rahm, H.H. Do, Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  129. B. Ramadan, P. Christen, H. Liang, R.W. Gayler, Dynamic sorted neighborhood indexing for real-time entity resolution. ACM JDIQ 6(4), 15 (2015)

    Google Scholar 

  130. T. Ranbaduge, P. Christen, D. Vatsalan, Tree based scalable indexing for multi-party privacy-preserving record linkage, in AusDM (2014)

    Google Scholar 

  131. T. Ranbaduge, D. Vatsalan, P. Christen, Clustering-based scalable indexing for multi-party privacy-preserving record linkage, in Springer PAKDD (2015), pp. 549–561

    Google Scholar 

  132. T. Ranbaduge, D. Vatsalan, P. Christen, Merlin–a tool for multi-party privacy-preserving record linkage, in IEEE ICDMW (2015), pp. 1640–1643

    Google Scholar 

  133. T. Ranbaduge, D. Vatsalan, P. Christen, Hashing-based distributed multi-party blocking for privacy-preserving record linkage, in Springer PAKDD (2016), pp. 415–427

    Google Scholar 

  134. T. Ranbaduge, D. Vatsalan, S. Randall, P. Christen, Evaluation of advanced techniques for multi-party privacy-preserving record linkage on real-world health databases, in IPDLN (2016)

    Google Scholar 

  135. S.M. Randall, A.M. Ferrante, J.H. Boyd, J.B. Semmens, Privacy-preserving record linkage on large real world datasets, in Elsevier JBI (2014) volume 50, pp. 205–212

    Google Scholar 

  136. S.M. Randall, A.M. Ferrante, J.H. Boyd, A.P. Brown, J.B. Semmens, Limited privacy protection and poor sensitivity is it time to move on from the statistical linkage key-581? Health Inf. Manag. J. 37, 60–62 (2016)

    Google Scholar 

  137. V. Rastogi, N. Dalvi, M. Garofalakis, Large-scale collective entity matching. in VLDB 4, 208–218 (2011)

    Google Scholar 

  138. C. Rong, W. Lu, X. Wang, X. Du, Y. Chen, A.K.H. Tung, Efficient and scalable processing of string similarity join. IEEE TKDE 25(10), 2217–2230 (2013)

    Google Scholar 

  139. M. Roughan, Y. Zhang, Secure distributed data-mining and its application to large-scale network measurements. ACM SIGCOMM Comput. Commun. Rev. 36(1), 7–14 (2006)

    Article  Google Scholar 

  140. T. Ryan, D. Gibson, B. Holmes, A national minimum data set for home and community care, in Australian Institute of Health and Welfare (1999)

    Google Scholar 

  141. M. Scannapieco, I. Figotin, E. Bertino, A. Elmagarmid, Privacy preserving schema and data matching, in ACM SIGMOD (2007), pp. 653–664

    Google Scholar 

  142. D.A. Schneider, D.J. DeWitt, Tradeoffs in processing complex join queries via hashing in multiprocessor database machines, in VLDB (1990), pp. 469–480

    Google Scholar 

  143. B. Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C, 2nd edn. (Wiley, New York, 1996)

    Google Scholar 

  144. R. Schnell, Privacy-preserving record linkage and privacy-preserving blocking for large files with cryptographic keys using multibit trees, in JSM (2013), pp. 187–194

    Google Scholar 

  145. R. Schnell, An efficient privacy-preserving record linkage technique for administrative data and censuses. Stat. J. IAOS 30(3), 263–270 (2014)

    Google Scholar 

  146. R. Schnell, T. Bachteler, S. Bender, A toolbox for record linkage. Aust. J. Stat. 33(1–2), 125–133 (2004)

    Google Scholar 

  147. R. Schnell, T. Bachteler, J. Reiher, Privacy-preserving record linkage using Bloom filters. BMC Medi. Inf. Decision Mak. 9(1), 41 (2009)

    Google Scholar 

  148. R. Schnell, T. Bachteler, J. Reiher, A novel error-tolerant anonymous linking code, in German Record Linkage Center, WP-GRLC-2011-02 (2011)

    Google Scholar 

  149. Z. Sehili, E. Rahm, Speeding up privacy preserving record linkage for metric space similarity measures, in Datenbank-Spektrum (2016), pp. 1–10

    Google Scholar 

  150. Z. Sehili, L. Kolb, C. Borgs, R. Schnell, E. Rahm, Privacy preserving record linkage with PP Join, in BTW Conference (2015)

    Google Scholar 

  151. D. Song, D. Wagner, A. Perrig, Practical techniques for searches on encrypted data, in IEEE Symposium on Security and Privacy (2000), pp. 44–55

    Google Scholar 

  152. L. Sweeney, K-anonymity: a model for protecting privacy. Int. J. Uncertaint. Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  153. K.-N. Tran, D. Vatsalan, P. Christen, GeCo: an online personal data generator and corruptor, in ACM CIKM (2013), pp. 2473–2476

    Google Scholar 

  154. S. Trepetin, Privacy-preserving string comparisons in record linkage systems: a review. Inf. Secur. J.: A Global Perspect. 17(5), 253–266 (2008)

    Google Scholar 

  155. E. Turgay, T. Pedersen, Y. Saygın, E. Savaş, A. Levi, Disclosure risks of distance preserving data transformations, in Springer SSDBM (2008), pp. 79–94

    Google Scholar 

  156. J. Vaidya, Y. Zhu, C.W. Clifton, Privacy Preserving Data Mining, vol. 19. Advances in Information Security (Springer, Berlin, 2006)

    Google Scholar 

  157. E. Van Eycken, K. Haustermans, F. Buntinx et al., Evaluation of the encryption procedure and record linkage in the Belgian national cancer registry. Archiv. Public Health 58(6), 281–294 (2000)

    Google Scholar 

  158. D. Vatsalan, P. Christen, An iterative two-party protocol for scalable privacy-preserving record linkage, in AusDM, CRPIT (2012), pp. 127–138

    Google Scholar 

  159. D. Vatsalan, P. Christen, Sorted nearest neighborhood clustering for efficient private blocking, in Springer PAKDD, vol. 7819 (2013), pp. 341–352

    Google Scholar 

  160. D. Vatsalan, P. Christen, Scalable privacy-preserving record linkage for multiple databases, in ACM CIKM (2014), pp. 1795–1798

    Google Scholar 

  161. D. Vatsalan, P. Christen, Privacy-preserving matching of similar patients. Elsevier JBI 59, 285–298 (2016)

    Google Scholar 

  162. D. Vatsalan, P. Christen, V.S. Verykios, An efficient two-party protocol for approximate matching in private record linkage, in AusDM (2011), pp. 125–136

    Google Scholar 

  163. D. Vatsalan, P. Christen, V.S. Verykios, Efficient two-party private blocking based on sorted nearest neighborhood clustering, in ACM CIKM (2013), pp. 1949–1958

    Google Scholar 

  164. D. Vatsalan, P. Christen, V.S. Verykios, A taxonomy of privacy-preserving record linkage techniques. Elsevier JIS 38(6), 946–969 (2013)

    Google Scholar 

  165. D. Vatsalan, P. Christen, C.M. O’Keefe, V.S. Verykios, An evaluation framework for privacy-preserving record linkage. JPC 6(1), 3 (2014), pp. 35–75

    Google Scholar 

  166. R. Vernica, M.J. Carey, C. Li, Efficient parallel set-similarity joins using MapReduce, in ACM SIGMOD (2010), pp. 495–506

    Google Scholar 

  167. V.S. Verykios, A. Karakasidis, V. Mitrogiannis, Privacy preserving record linkage approaches. IJDMMM 1(2), 206–221 (2009)

    Article  MATH  Google Scholar 

  168. G. Wang, H. Chen, H. Atabakhsh, Automatically detecting deceptive criminal identities. Commun. ACM 47(3), 70–76 (2004)

    Article  Google Scholar 

  169. Q. Wang, D. Vatsalan, P. Christen, Efficient interactive training selection for large-scale entity resolution, in PAKDD (2015), pp. 562–573

    Google Scholar 

  170. Z. Wen, C. Dong, Efficient protocols for private record linkage, in ACM Symposium on Applied Computing (2014), pp. 1688–1694

    Google Scholar 

  171. W.E. Winkler, Methods for evaluating and creating data quality. Elsevier JIS 29(7), 531–550 (2004)

    Google Scholar 

  172. C. Xiao, W. Wang, X. Lin, J.X. Yu, Efficient similarity joins for near duplicate detection, in WWW (2008), pp. 131–140

    Google Scholar 

  173. M. Yakout, M. Atallah, A. Elmagarmid, Efficient private record linkage, in IEEE ICDE (2009), pp. 1283–1286

    Google Scholar 

  174. P. Zezula, G. Amato, V. Dohnal, M. Batko, Similarity Search: The Metric Space Approach, vol. 32 (Springer, Berlin, 2006)

    Google Scholar 

  175. X. Zhang, C. Liu, S. Nepal, J. Chen, An efficient quasi-identifier index based approach for privacy preservation over incremental data sets on cloud. J. Comput. Syst. Sci. 79(5), 542–555 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  176. X. Zhang, C. Liu, S. Nepal, S. Pandey, J. Chen, A privacy leakage upper bound constraint-based approach for cost-effective privacy preserving of intermediate data sets in cloud. IEEE TPDS 24(6), 1192–1202 (2013)

    Google Scholar 

Download references

Acknowledgements

This work was partially funded by the Australian Research Council under Discovery Project DP130101801, the German Academic Exchange Service (DAAD) and Universities Australia (UA) under the Joint Research Co-operation Scheme, and also funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Erhard Rahm .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Vatsalan, D., Sehili, Z., Christen, P., Rahm, E. (2017). Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49340-4_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49339-8

  • Online ISBN: 978-3-319-49340-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics