Advertisement

Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges

  • Dinusha Vatsalan
  • Ziad Sehili
  • Peter Christen
  • Erhard RahmEmail author

Abstract

The growth of Big Data, especially personal data dispersed in multiple data sources, presents enormous opportunities and insights for businesses to explore and leverage the value of linked and integrated data. However, privacy concerns impede sharing or exchanging data for linkage across different organizations. Privacy-preserving record linkage (PPRL) aims to address this problem by identifying and linking records that correspond to the same real-world entity across several data sources held by different parties without revealing any sensitive information about these entities. PPRL is increasingly being required in many real-world application areas. Examples range from public health surveillance to crime and fraud detection, and national security. PPRL for Big Data poses several challenges, with the three major ones being (1) scalability to multiple large databases, due to their massive volume and the flow of data within Big Data applications, (2) achieving high quality results of the linkage in the presence of variety and veracity of Big Data, and (3) preserving privacy and confidentiality of the entities represented in Big Data collections. In this chapter, we describe the challenges of PPRL in the context of Big Data, survey existing techniques for PPRL, and provide directions for future research.

Keywords

Record linkage Privacy Big data Scalability 

Notes

Acknowledgements

This work was partially funded by the Australian Research Council under Discovery Project DP130101801, the German Academic Exchange Service (DAAD) and Universities Australia (UA) under the Joint Research Co-operation Scheme, and also funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B).

References

  1. 1.
    R. Agrawal, A. Evfimievski, R. Srikant, Information sharing across private databases, in ACM SIGMOD (2003), pp. 86–97Google Scholar
  2. 2.
    A. Arasu, V. Ganti, R. Kaushik, Efficient exact set-similarity joins, in PVLDB (2006), pp. 918–929Google Scholar
  3. 3.
    A. Arasu, M. Götz, R. Kaushik, On active learning of record matching packages, in ACM SIGMOD (2010), pp. 783–794Google Scholar
  4. 4.
    Y. Aumann, Y. Lindell, Security against covert adversaries: efficient protocols for realistic adversaries. J. Cryptol. 23(2), 281–343 (2010)Google Scholar
  5. 5.
    T. Bachteler, J. Reiher, and R. Schnell. Similarity Filtering with Multibit Trees for Record Linkage. Technical Report WP-GRLC-2013-01, German Record Linkage Center, 2013Google Scholar
  6. 6.
    D. Barone, A. Maurino, F. Stella, C. Batini, A privacy-preserving framework for accuracy and completeness quality assessment, in Emerging Paradigms in Informatics, Systems and Communication (2009), pp. 83–87Google Scholar
  7. 7.
    J.E. Barros, J.C. French, W.N. Martin, P.M. Kelly, T.M. Cannon, Using the triangle inequality to reduce the number of comparisons required for similarity-based retrieval, in Electronic Imaging Science and Technology (1996), pp. 392–403Google Scholar
  8. 8.
    C. Batini, M. Scannapieca, Data quality: Concepts, Methodologies And Techniques. Data-Centric Systems and Applications (Springer, Berlin, 2006)Google Scholar
  9. 9.
    R. Baxter, P. Christen, T. Churches, A comparison of fast blocking methods for record linkage, in SIGKDD Workshop on Data Cleaning, Record Linkage and Object Consolidation (2003), pp. 25–27Google Scholar
  10. 10.
    R.J. Bayardo, Y. Ma, R. Srikant, Scaling Up All Pairs Similarity Search, in WWW (2007), pp. 131–140Google Scholar
  11. 11.
    K. Bellare, S. Iyengar, A.G. Parameswaran, V. Rastogi, Active sampling for entity matching, in ACM SIGKDD (2012), pp. 1131–1139Google Scholar
  12. 12.
    A. Berman, L.G. Shapiro, Selecting good keys for triangle-inequality-based pruning algorithms, in IEEE Workshop on Content-Based Access of Image and Video Database (1998), pp. 12–19Google Scholar
  13. 13.
    I. Bhattacharya, L. Getoor, Collective entity resolution in relational data. ACM TKDD 1(1), 1–35 (2007)Google Scholar
  14. 14.
    M. Bilenko, R.J. Mooney, Adaptive duplicate detection using learnable string similarity measures, in ACM SIGKDD (2003), pp. 39–48Google Scholar
  15. 15.
    B. Bloom, Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRefzbMATHGoogle Scholar
  16. 16.
    L. Bonomi, L. Xiong, R. Chen, B. Fung, Frequent grams based embedding for privacy preserving record linkage, in ACM CIKM (2012), pp. 1597–1601Google Scholar
  17. 17.
    H. Bouzelat, C. Quantin, L. Dusserre, Extraction and anonymity protocol of medical file, in AMIA Fall Symposium (1996), pp. 323–327Google Scholar
  18. 18.
    A.Z. Broder, On the resemblance and containment of documents, in Compression and Complexity of Sequences. IEEE (1997), pp. 21–29Google Scholar
  19. 19.
    A. Broder, M. Mitzenmacher, A. Mitzenmacher, Network applications of Bloom filters: a survey. Internet Math. 1(4), 485–509 (2004)Google Scholar
  20. 20.
    E. Brook, D. Rosman, C. Holman, Public good through data linkage: measuring research outputs from the Western Australian data linkage system. Aust. NZ J. Public Health 32, 19–23 (2008)CrossRefGoogle Scholar
  21. 21.
    R. Canetti, Security and composition of multiparty cryptographic protocols. J. Cryptol. 13(1), 143–202 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    A. Cavoukian, J. Jonas, Privacy by design in the age of Big Data. Technical report, TR Information and privacy commissioner, Ontario (2012)Google Scholar
  23. 23.
    P. Christen, A comparison of personal name matching: techniques and practical issues, in IEEE ICDM Workshop on Mining Complex Data (2006), pp. 290–294Google Scholar
  24. 24.
    P. Christen, Privacy-preserving data linkage and geocoding: current approaches and research directions, in IEEE ICDM Workshop on Privacy Aspects of Data Mining (2006), pp. 497–501Google Scholar
  25. 25.
    P. Christen, Automatic record linkage using seeded nearest neighbour and support vector machine classification, in ACM SIGKDD (2008), pp. 151–159Google Scholar
  26. 26.
    P. Christen, Febrl: an open source data cleaning, deduplication and record linkage system with a graphical user interface, in ACM SIGKDD (2008), pp. 1065–1068Google Scholar
  27. 27.
    P. Christen, Geocode matching and privacy preservation, in Workshop on Privacy, Security, and Trust in KDD (Springer, Berlin, 2009), pp. 7–24Google Scholar
  28. 28.
    P. Christen, Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection (Springer, Berlin, 2012)Google Scholar
  29. 29.
    P. Christen, A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE 24(9), 1537–1555 (2012)Google Scholar
  30. 30.
    P. Christen, T. Churches, M. Hegland, Febrl – a parallel open source data linkage system, in Springer PAKDD (2004), pp. 638–647Google Scholar
  31. 31.
    P. Christen, K. Goiser, Quality and complexity measures for data linkage and deduplication, in Quality Measures in Data Mining, vol. 43. Studies in Computational Intelligence (Springer, Berlin, 2007), pp. 127–151Google Scholar
  32. 32.
    P. Christen, R. Gayler, D. Hawking, Similarity-aware indexing for real-time entity resolution, in ACM CIKM (2009), pp. 1565–1568Google Scholar
  33. 33.
    P. Christen, R.W. Gayler, Adaptive temporal entity resolution on dynamic databases, in PAKDD (2013), pp. 558–569Google Scholar
  34. 34.
    P. Christen, D. Vatsalan, Flexible and extensible generation and corruption of personal data, in ACM CIKM (2013), pp. 1165–1168Google Scholar
  35. 35.
    T. Churches, P. Christen, Some methods for blindfolded record linkage. BioMed Cent. Med. Inf. Decision Mak. 4(9), (2004)Google Scholar
  36. 36.
    T. Churches, P. Christen, K. Lim, J.X. Zhu, Preparation of name and address data for record linkage using hidden Markov models. BioMed Cent. Med. Inf. Decision Mak. 2(9), (2002)Google Scholar
  37. 37.
    D.E. Clark, Practical introduction to record linkage for injury research. Inj. Prev. 10, 186–191 (2004)CrossRefGoogle Scholar
  38. 38.
    C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, M. Zhu, Tools for privacy preserving distributed data mining. SIGKDD Explor. 4(2), 28–34 (2002)CrossRefGoogle Scholar
  39. 39.
    W.W. Cohen, Data integration using similarity joins and a word-based information representation language. ACM TOIS 18(3), 288–321 (2000)MathSciNetCrossRefGoogle Scholar
  40. 40.
    W.W. Cohen, J. Richman, Learning to match and cluster large high-dimensional data sets for data integration, in ACM SIGKDD (2002), pp. 475–480Google Scholar
  41. 41.
    G. Cormode, S. Muthukrishnan, An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  42. 42.
    G. Dal Bianco, R. Galante, C.A. Heuser, A fast approach for parallel deduplication on multicore processors, in ACM Symposium on Applied Computing (2011), pp. 1027–1032Google Scholar
  43. 43.
    D. Dey, V. Mookerjee, D. Liu, Efficient techniques for online record linkage. IEEE TKDE 23(3), 373–387 (2010)Google Scholar
  44. 44.
    W. Du, M. Atallah, Protocols for secure remote database access with approximate matching, in ACM WSPEC (Springer, Berlin, 2000), pp. 87–111Google Scholar
  45. 45.
    G.T. Duncan, M. Elliot, J.-J. Salazar-González, Statistical Confidentiality: Principles and Practice (Springer, New York, 2011)Google Scholar
  46. 46.
    E. Durham, A framework for accurate, efficient private record linkage. Ph.D. thesis, Faculty of the Graduate School of Vanderbilt University, Nashville, TN, 2012Google Scholar
  47. 47.
    E. Durham, Y. Xue, M. Kantarcioglu, B. Malin, Private medical record linkage with approximate matching, in AMIA Annual Symposium (2010), pp. 182–186Google Scholar
  48. 48.
    E.A. Durham, C. Toth, M. Kuzu, M. Kantarcioglu, Y. Xue, B. Malin, Composite Bloom filters for secure record linkage. IEEE TKDE 26(12), pp. 2956–2968 (2013)Google Scholar
  49. 49.
    L. Dusserre, C. Quantin, H. Bouzelat, A one way public key cryptosystem for the linkage of nominal files in epidemiological studies. Medinfo 8, 644–647 (1995)Google Scholar
  50. 50.
    C. Dwork, Differential privacy, in ICALP (2006), pp. 1–12Google Scholar
  51. 51.
    M.G. Elfeky, V.S. Verykios, A.K. Elmagarmid, TAILOR: a record linkage toolbox, in IEEE ICDE (2002), pp. 17–28Google Scholar
  52. 52.
    A. Elmagarmid, P. Ipeirotis, V.S. Verykios, Duplicate record detection: a survey. IEEE TKDE 19(1), 1–16 (2007)Google Scholar
  53. 53.
    U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, Advances in Knowledge Discovery and Data Mining (The MIT Press, Cambridge, 1996)Google Scholar
  54. 54.
    I.P. Fellegi, A.B. Sunter, A theory for record linkage. J. Am. Stat. Soc. 64(328), 1183–1210 (1969)CrossRefzbMATHGoogle Scholar
  55. 55.
    S.E. Fienberg, Confidentiality and disclosure limitation. Encycl. Soc. Meas. 1, 463–469 (2005)CrossRefGoogle Scholar
  56. 56.
    B. Forchhammer, T. Papenbrock, T. Stening, S. Viehmeier, U. Draisbach, F. Naumann, Duplicate detection on GPUs, in BTW (2013), pp. 165–184Google Scholar
  57. 57.
    M. Freedman, Y. Ishai, B. Pinkas, O. Reingold, Keyword search and oblivious pseudorandom functions, in Theory of Cryptography (2005), pp. 303–324Google Scholar
  58. 58.
    Z. Fu, J. Zhou, P. Christen, M. Boot, Multiple instance learning for group record linkage, in PAKDD, Springer LNAI (2012), pp. 171–182Google Scholar
  59. 59.
    B. Fung, K. Wang, R. Chen, P.S. Yu, Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. 42(4), 14 (2010)CrossRefGoogle Scholar
  60. 60.
    S.R. Ganta, S.P. Kasiviswanathan, A. Smith, Composition attacks and auxiliary information in data privacy, in ACM SIGKDD (2008), pp. 265–273Google Scholar
  61. 61.
    A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in VLDB (1999), pp. 518–529Google Scholar
  62. 62.
    O. Goldreich, Foundations of Cryptography: Basic Applications, vol. 2. (Cambridge University Press, Cambridge, 2004)Google Scholar
  63. 63.
    L. Gu, R. Baxter, Decision models for record linkage, in Selected Papers from AusDM. LNCS, vol. 3755 (Springer, Berlin, 2006), pp. 146–160Google Scholar
  64. 64.
    M. Hadjieleftheriou, A. Chandel, N. Koudas, D. Srivastava, Fast indexes and algorithms for set similarity selection queries, in IEEE ICDE (2008), pp. 267–276Google Scholar
  65. 65.
    R. Hall, S. Fienberg, Privacy-preserving record linkage, in PSD (2010), pp. 269–283Google Scholar
  66. 66.
    M. Herschel, F. Naumann, S. Szott, M. Taubert, Scalable iterative graph duplicate detection. IEEE TKDE 24(11), 2094–2108 (2012)Google Scholar
  67. 67.
    A. Inan, M. Kantarcioglu, E. Bertino, M. Scannapieco, A hybrid approach to private record linkage, in IEEE ICDE (2008), pp. 496–505Google Scholar
  68. 68.
    A. Inan, M. Kantarcioglu, G. Ghinita, E. Bertino. Private record matching using differential privacy, in EDBT (2010), pp. 123–134Google Scholar
  69. 69.
    P. Indyk, R. Motwani, Approximate nearest neighbors: Towards removing the curse of dimensionality, in ACM Symposium on the Theory of Computing (1998), pp. 604–613Google Scholar
  70. 70.
    E. Ioannou, W. Nejdl, C. Niederée, Y. Velegrakis, On-the-fly entity-aware query processing in the presence of linkage. PVLDB 3(1–2), 429–438 (2010)Google Scholar
  71. 71.
    W. Jiang, C. Clifton, Ac-framework for privacy-preserving collaboration, in SDM SIAM (2007), pp. 47–56Google Scholar
  72. 72.
    W. Jiang, C. Clifton, M. Kantarcıoğlu, Transforming semi-honest protocols to ensure accountability. Elsevier DKE 65(1), 57–74 (2008)CrossRefGoogle Scholar
  73. 73.
    J. Jonas, J. Harper, Effective counterterrorism and the limited role of predictive data mining. Policy Anal. 584, 1–12 (2006)Google Scholar
  74. 74.
    D. Kalashnikov, S. Mehrotra, Domain-independent data cleaning via analysis of entity-relationship graph. ACM TODS 31(2), 716–767 (2006)CrossRefGoogle Scholar
  75. 75.
    M. Kantarcioglu, W. Jiang, B. Malin, A privacy-preserving framework for integrating person-specific databases, in PSD (2008), pp. 298–314Google Scholar
  76. 76.
    A. Karakasidis, V.S. Verykios, Secure blocking\(+\)secure matching \(=\) secure record linkage. JCSE 5, 223–235 (2011)Google Scholar
  77. 77.
    A. Karakasidis, V.S. Verykios, Reference table based k-anonymous private blocking, in ACM SAC (2012), pp. 859–864Google Scholar
  78. 78.
    A. Karakasidis, V.S. Verykios, A sorted neighborhood approach to multidimensional privacy preserving blocking, in IEEE ICDMW (2012), pp. 937–944Google Scholar
  79. 79.
    A. Karakasidis, V.S. Verykios, P. Christen, Fake injection strategies for private phonetic matching. DPM Springer 7122, 9–24 (2012)Google Scholar
  80. 80.
    D. Karapiperis, D. Vatsalan, V.S. Verykios, P. Christen, Large-scale multi-party counting set intersection using a space efficient global synopsis, in DASFAA (2015), pp. 329–345Google Scholar
  81. 81.
    D. Karapiperis, D. Vatsalan, V.S. Verykios, P. Christen, Efficient record linkage using a compact hamming space, in EDBT (2016), pp. 209–220Google Scholar
  82. 82.
    D. Karapiperis, V.S. Verykios, A distributed framework for scaling up LSH-based computations in privacy preserving record linkage, in ACM BCI (2013), pp. 102–109Google Scholar
  83. 83.
    D. Karapiperis, V.S. Verykios, A distributed near-optimal LSH-based framework for privacy-preserving record linkage. ComSIS 11(2), 745–763 (2014)CrossRefGoogle Scholar
  84. 84.
    D. Karapiperis, V.S. Verykios, An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. IEEE TKDE 27(4), 909–921 (2015)Google Scholar
  85. 85.
    D. Karapiperis, V.S. Verykios, A fast and efficient hamming LSH-based scheme for accurate linkage, in Springer KAIS (2016), pp. 1–24Google Scholar
  86. 86.
    H. Kargupta, S. Datta, Q. Wang, K. Sivakumar, On the privacy preserving properties of random data perturbation techniques, in IEEE ICDM (2003), p. 99Google Scholar
  87. 87.
    H. Kargupta, S. Datta, Q. Wang, K. Sivakumar, Random-data perturbation techniques and privacy-preserving data mining, Springer KAIS 7(4), 387–414 (2005)Google Scholar
  88. 88.
    C.W. Kelman, J. Bass, D. Holman, Research use of linked health data - a best practice protocol. Aust. NZ J. Public Health 26, 251–255 (2002)CrossRefGoogle Scholar
  89. 89.
    H. Kim, D. Lee, Harra: fast iterative hashed record linkage for large-scale data collections, in EDBT (2010), pp. 525–536Google Scholar
  90. 90.
    H.-s. Kim, D. Lee, Parallel linkage, in ACM CIKM (2007), pp. 283–292Google Scholar
  91. 91.
    T. Kirsten, L. Kolb, M. Hartung, A. Groß, H. Köpcke, E. Rahm, Data partitioning for parallel entity matching, in QDB (2010)Google Scholar
  92. 92.
    L. Kissner, D. Song, Private and threshold set-intersection, in Technical Report. Carnegie Mellon University, 2004Google Scholar
  93. 93.
    L. Kolb, A. Thor, E. Rahm, Dedoop: efficient deduplication with Hadoop. PVLDB 5(12), 1878–1881 (2012)Google Scholar
  94. 94.
    L. Kolb, A. Thor, E. Rahm, Load balancing for mapreduce-based entity resolution, in IEEE ICDE (2012), pp. 618–629Google Scholar
  95. 95.
    H. Köpcke, E. Rahm, Frameworks for entity matching: a comparison. Elsevier DKE 69(2), 197–210 (2010)CrossRefGoogle Scholar
  96. 96.
    H. Köpcke, A. Thor, E. Rahm, Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1), 484–493 (2010)Google Scholar
  97. 97.
    H. Krawczyk, M. Bellare, R. Canetti, HMAC: keyed-hashing for message authentication, in Internet RFCs (1997)Google Scholar
  98. 98.
    T.G. Kristensen, J. Nielsen, C.N. Pedersen, A tree-based method for the rapid screening of chemical fingerprints. Algorithms Mol. Biol. 5(1), 9 (2010)CrossRefGoogle Scholar
  99. 99.
    H. Kum, A. Krishnamurthy, A. Machanavajjhala, S. Ahalt, Population informatics: tapping the social genome to advance society: a vision for putting “big data” to work for population informatics. Computer (2013)Google Scholar
  100. 100.
    H.-C. Kum, A. Krishnamurthy, A. Machanavajjhala, M.K. Reiter, S. Ahalt, Privacy preserving interactive record linkage. JAMIA 21(2), 212–220 (2014)Google Scholar
  101. 101.
    M. Kuzu, M. Kantarcioglu, E. Durham, B. Malin, A constraint satisfaction cryptanalysis of Bloom filters in private record linkage. PETS Springer LNCS 6794, 226–245 (2011)Google Scholar
  102. 102.
    M. Kuzu, M. Kantarcioglu, E.A. Durham, C. Toth, B. Malin, A practical approach to achieve private medical record linkage in light of public resources. JAMIA 20(2), 285–292 (2013)Google Scholar
  103. 103.
    M. Kuzu, M. Kantarcioglu, A. Inan, E. Bertino, E. Durham, B. Malin, Efficient privacy-aware record integration, in ACM EDBT (2013), pp. 167–178Google Scholar
  104. 104.
    P. Lai, S. Yiu, K. Chow, C. Chong, L. Hui, An efficient Bloom filter based solution for multiparty private matching, in SAM (2006)Google Scholar
  105. 105.
    F. Li, Y. Chen, B. Luo, D. Lee, P. Liu, Privacy preserving group linkage, in Scientific and Statistical Database Management (Springer, Berlin, 2011), pp. 432–450Google Scholar
  106. 106.
    N. Li, T. Li, S. Venkatasubramanian, T-closeness: privacy beyond k-anonymity and l-diversity, in IEEE ICDE (2007), pp. 106–115Google Scholar
  107. 107.
    P. Li, X. Dong, A. Maurino, D. Srivastava, Linking temporal records. PVLDB 4(11), 956–967 (2011)Google Scholar
  108. 108.
    Z. Lin, M. Hewett, R.B. Altman, Using binning to maintain confidentiality of medical data, in AMIA Symposium (2002), p. 454Google Scholar
  109. 109.
    Y. Lindell, B. Pinkas, Privacy preserving data mining, in CRYPTO (Springer, Berlin, 2000), pp. 36–54Google Scholar
  110. 110.
    Y. Lindell, B. Pinkas, An efficient protocol for secure two-party computation in the presence of malicious adversaries, in EUROCRYPT (2007), pp. 52–78Google Scholar
  111. 111.
    Y. Lindell, B. Pinkas, Secure multiparty computation for privacy-preserving data mining. JPC 1(1), 5 (2009), pp. 59–98Google Scholar
  112. 112.
    H. Liu, H. Wang, Y. Chen, Ensuring data storage security against frequency-based attacks in wireless networks, in DCOSS, Springer LNCS, vol. 6131 (2010), pp. 201–215Google Scholar
  113. 113.
    H. Lu, M.-C. Shan, K.-L. Tan, Optimization of multi-way join queries for parallel execution, in VLDB (1991), pp. 549–560Google Scholar
  114. 114.
    M. Luby, C. Rackoff, How to construct pseudo-random permutations from pseudo-random functions, in CRYPTO, vol. 85 (1986), p. 447Google Scholar
  115. 115.
    A. Machanavajjhala, D. Kifer, J. Gehrke, M. Venkitasubramaniam, l-diversity: privacy beyond k-anonymity. ACM TKDD 1(1), 3 (2007)CrossRefGoogle Scholar
  116. 116.
    B.A. Malin, K. El Emam, C.M. O’Keefe, Biomedical data privacy: problems, perspectives, and recent advances. JAMIA 20(1), 2–6 (2013)Google Scholar
  117. 117.
    M. Mitzenmacher, E. Upfal, Probability and Computing: Randomized Algorithms and Probabilistic Analysis (Cambridge University Press, Cambridge, 2005)Google Scholar
  118. 118.
    N. Mohammed, B. Fung, M. Debbabi, Anonymity meets game theory: secure data integration with malicious participants. PVLDB 20(4), 567–588 (2011)Google Scholar
  119. 119.
    M. Nentwig, M. Hartung, A.-C. Ngonga Ngomo, E. Rahm, A survey of current link discovery frameworks. Semantic Web Journal (2016)Google Scholar
  120. 120.
    A.N. Ngomo, L. Kolb, N. Heino, M. Hartung, S. Auer, E. Rahm, When to reach for the cloud: using parallel hardware for link discovery, in ESWC (2013), pp. 275–289Google Scholar
  121. 121.
    Office for National Statistics, Beyond 2011 matching anonymous data (2013)Google Scholar
  122. 122.
    C. O’Keefe, M. Yung, L. Gu, R. Baxter, Privacy-preserving data linkage protocols, in ACM WPES (2004), pp. 94–102Google Scholar
  123. 123.
    B. On, N. Koudas, D. Lee, D. Srivastava, Group linkage, in IEEE ICDE (2007), pp. 496–505Google Scholar
  124. 124.
    C. Pang, L. Gu, D. Hansen, A. Maeder, Privacy-preserving fuzzy matching using a public reference table, in Intelligent Patient Management, vol. 189. Studies in Computational Intelligence (Springer, Berlin, 2009), pp. 71–89Google Scholar
  125. 125.
    C. Phua, K. Smith-Miles, V. Lee, R. Gayler, Resilient identity crime detection. IEEE TKDE 24(3), 533–546 (2012)Google Scholar
  126. 126.
    C. Quantin, H. Bouzelat, L. Dusserre, Irreversible encryption method by generation of polynomials. Med. Inf. Internet Med. 21(2), 113–121 (1996)Google Scholar
  127. 127.
    C. Quantin, H. Bouzelat, F. Allaert, A. Benhamiche, J. Faivre, L. Dusserre, How to ensure data security of an epidemiological follow-up: quality assessment of an anonymous record linkage procedure. IJMI 49(1), 117–122 (1998)Google Scholar
  128. 128.
    E. Rahm, H.H. Do, Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)Google Scholar
  129. 129.
    B. Ramadan, P. Christen, H. Liang, R.W. Gayler, Dynamic sorted neighborhood indexing for real-time entity resolution. ACM JDIQ 6(4), 15 (2015)Google Scholar
  130. 130.
    T. Ranbaduge, P. Christen, D. Vatsalan, Tree based scalable indexing for multi-party privacy-preserving record linkage, in AusDM (2014)Google Scholar
  131. 131.
    T. Ranbaduge, D. Vatsalan, P. Christen, Clustering-based scalable indexing for multi-party privacy-preserving record linkage, in Springer PAKDD (2015), pp. 549–561Google Scholar
  132. 132.
    T. Ranbaduge, D. Vatsalan, P. Christen, Merlin–a tool for multi-party privacy-preserving record linkage, in IEEE ICDMW (2015), pp. 1640–1643Google Scholar
  133. 133.
    T. Ranbaduge, D. Vatsalan, P. Christen, Hashing-based distributed multi-party blocking for privacy-preserving record linkage, in Springer PAKDD (2016), pp. 415–427Google Scholar
  134. 134.
    T. Ranbaduge, D. Vatsalan, S. Randall, P. Christen, Evaluation of advanced techniques for multi-party privacy-preserving record linkage on real-world health databases, in IPDLN (2016)Google Scholar
  135. 135.
    S.M. Randall, A.M. Ferrante, J.H. Boyd, J.B. Semmens, Privacy-preserving record linkage on large real world datasets, in Elsevier JBI (2014) volume 50, pp. 205–212Google Scholar
  136. 136.
    S.M. Randall, A.M. Ferrante, J.H. Boyd, A.P. Brown, J.B. Semmens, Limited privacy protection and poor sensitivity is it time to move on from the statistical linkage key-581? Health Inf. Manag. J. 37, 60–62 (2016)Google Scholar
  137. 137.
    V. Rastogi, N. Dalvi, M. Garofalakis, Large-scale collective entity matching. in VLDB 4, 208–218 (2011)Google Scholar
  138. 138.
    C. Rong, W. Lu, X. Wang, X. Du, Y. Chen, A.K.H. Tung, Efficient and scalable processing of string similarity join. IEEE TKDE 25(10), 2217–2230 (2013)Google Scholar
  139. 139.
    M. Roughan, Y. Zhang, Secure distributed data-mining and its application to large-scale network measurements. ACM SIGCOMM Comput. Commun. Rev. 36(1), 7–14 (2006)CrossRefGoogle Scholar
  140. 140.
    T. Ryan, D. Gibson, B. Holmes, A national minimum data set for home and community care, in Australian Institute of Health and Welfare (1999)Google Scholar
  141. 141.
    M. Scannapieco, I. Figotin, E. Bertino, A. Elmagarmid, Privacy preserving schema and data matching, in ACM SIGMOD (2007), pp. 653–664Google Scholar
  142. 142.
    D.A. Schneider, D.J. DeWitt, Tradeoffs in processing complex join queries via hashing in multiprocessor database machines, in VLDB (1990), pp. 469–480Google Scholar
  143. 143.
    B. Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C, 2nd edn. (Wiley, New York, 1996)Google Scholar
  144. 144.
    R. Schnell, Privacy-preserving record linkage and privacy-preserving blocking for large files with cryptographic keys using multibit trees, in JSM (2013), pp. 187–194Google Scholar
  145. 145.
    R. Schnell, An efficient privacy-preserving record linkage technique for administrative data and censuses. Stat. J. IAOS 30(3), 263–270 (2014)Google Scholar
  146. 146.
    R. Schnell, T. Bachteler, S. Bender, A toolbox for record linkage. Aust. J. Stat. 33(1–2), 125–133 (2004)Google Scholar
  147. 147.
    R. Schnell, T. Bachteler, J. Reiher, Privacy-preserving record linkage using Bloom filters. BMC Medi. Inf. Decision Mak. 9(1), 41 (2009)Google Scholar
  148. 148.
    R. Schnell, T. Bachteler, J. Reiher, A novel error-tolerant anonymous linking code, in German Record Linkage Center, WP-GRLC-2011-02 (2011)Google Scholar
  149. 149.
    Z. Sehili, E. Rahm, Speeding up privacy preserving record linkage for metric space similarity measures, in Datenbank-Spektrum (2016), pp. 1–10Google Scholar
  150. 150.
    Z. Sehili, L. Kolb, C. Borgs, R. Schnell, E. Rahm, Privacy preserving record linkage with PP Join, in BTW Conference (2015)Google Scholar
  151. 151.
    D. Song, D. Wagner, A. Perrig, Practical techniques for searches on encrypted data, in IEEE Symposium on Security and Privacy (2000), pp. 44–55Google Scholar
  152. 152.
    L. Sweeney, K-anonymity: a model for protecting privacy. Int. J. Uncertaint. Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  153. 153.
    K.-N. Tran, D. Vatsalan, P. Christen, GeCo: an online personal data generator and corruptor, in ACM CIKM (2013), pp. 2473–2476Google Scholar
  154. 154.
    S. Trepetin, Privacy-preserving string comparisons in record linkage systems: a review. Inf. Secur. J.: A Global Perspect. 17(5), 253–266 (2008)Google Scholar
  155. 155.
    E. Turgay, T. Pedersen, Y. Saygın, E. Savaş, A. Levi, Disclosure risks of distance preserving data transformations, in Springer SSDBM (2008), pp. 79–94Google Scholar
  156. 156.
    J. Vaidya, Y. Zhu, C.W. Clifton, Privacy Preserving Data Mining, vol. 19. Advances in Information Security (Springer, Berlin, 2006)Google Scholar
  157. 157.
    E. Van Eycken, K. Haustermans, F. Buntinx et al., Evaluation of the encryption procedure and record linkage in the Belgian national cancer registry. Archiv. Public Health 58(6), 281–294 (2000)Google Scholar
  158. 158.
    D. Vatsalan, P. Christen, An iterative two-party protocol for scalable privacy-preserving record linkage, in AusDM, CRPIT (2012), pp. 127–138Google Scholar
  159. 159.
    D. Vatsalan, P. Christen, Sorted nearest neighborhood clustering for efficient private blocking, in Springer PAKDD, vol. 7819 (2013), pp. 341–352Google Scholar
  160. 160.
    D. Vatsalan, P. Christen, Scalable privacy-preserving record linkage for multiple databases, in ACM CIKM (2014), pp. 1795–1798Google Scholar
  161. 161.
    D. Vatsalan, P. Christen, Privacy-preserving matching of similar patients. Elsevier JBI 59, 285–298 (2016)Google Scholar
  162. 162.
    D. Vatsalan, P. Christen, V.S. Verykios, An efficient two-party protocol for approximate matching in private record linkage, in AusDM (2011), pp. 125–136Google Scholar
  163. 163.
    D. Vatsalan, P. Christen, V.S. Verykios, Efficient two-party private blocking based on sorted nearest neighborhood clustering, in ACM CIKM (2013), pp. 1949–1958Google Scholar
  164. 164.
    D. Vatsalan, P. Christen, V.S. Verykios, A taxonomy of privacy-preserving record linkage techniques. Elsevier JIS 38(6), 946–969 (2013)Google Scholar
  165. 165.
    D. Vatsalan, P. Christen, C.M. O’Keefe, V.S. Verykios, An evaluation framework for privacy-preserving record linkage. JPC 6(1), 3 (2014), pp. 35–75Google Scholar
  166. 166.
    R. Vernica, M.J. Carey, C. Li, Efficient parallel set-similarity joins using MapReduce, in ACM SIGMOD (2010), pp. 495–506Google Scholar
  167. 167.
    V.S. Verykios, A. Karakasidis, V. Mitrogiannis, Privacy preserving record linkage approaches. IJDMMM 1(2), 206–221 (2009)CrossRefzbMATHGoogle Scholar
  168. 168.
    G. Wang, H. Chen, H. Atabakhsh, Automatically detecting deceptive criminal identities. Commun. ACM 47(3), 70–76 (2004)CrossRefGoogle Scholar
  169. 169.
    Q. Wang, D. Vatsalan, P. Christen, Efficient interactive training selection for large-scale entity resolution, in PAKDD (2015), pp. 562–573Google Scholar
  170. 170.
    Z. Wen, C. Dong, Efficient protocols for private record linkage, in ACM Symposium on Applied Computing (2014), pp. 1688–1694Google Scholar
  171. 171.
    W.E. Winkler, Methods for evaluating and creating data quality. Elsevier JIS 29(7), 531–550 (2004)Google Scholar
  172. 172.
    C. Xiao, W. Wang, X. Lin, J.X. Yu, Efficient similarity joins for near duplicate detection, in WWW (2008), pp. 131–140Google Scholar
  173. 173.
    M. Yakout, M. Atallah, A. Elmagarmid, Efficient private record linkage, in IEEE ICDE (2009), pp. 1283–1286Google Scholar
  174. 174.
    P. Zezula, G. Amato, V. Dohnal, M. Batko, Similarity Search: The Metric Space Approach, vol. 32 (Springer, Berlin, 2006)Google Scholar
  175. 175.
    X. Zhang, C. Liu, S. Nepal, J. Chen, An efficient quasi-identifier index based approach for privacy preservation over incremental data sets on cloud. J. Comput. Syst. Sci. 79(5), 542–555 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  176. 176.
    X. Zhang, C. Liu, S. Nepal, S. Pandey, J. Chen, A privacy leakage upper bound constraint-based approach for cost-effective privacy preserving of intermediate data sets in cloud. IEEE TPDS 24(6), 1192–1202 (2013)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Dinusha Vatsalan
    • 1
  • Ziad Sehili
    • 2
  • Peter Christen
    • 1
  • Erhard Rahm
    • 2
    Email author
  1. 1.Research School of Computer ScienceThe Australian National UniversityActonAustralia
  2. 2.Database GroupUniversity of LeipzigLeipzigGermany

Personalised recommendations