Abstract
In this paper, we propose an efficient scheme for privacy-preserving record linkage by using the Hamming locality-sensitive hashing technique as the blocking mechanism and the Bloom filter-based encoding method for anonymizing the data sets at hand. We achieve highly accurate results and simultaneously reduce significantly the computational cost by minimizing the number of distance computations performed. Our scheme provides theoretical guarantees for identifying the similar anonymized record pairs by conducting redundant blocking and by performing a distance computation only if the corresponding anonymized record pair is formulated a specified number of times. A series of experiments illustrate the efficacy of our scheme in identifying the similar record pairs, while simultaneously keeping the running time exceptionally low.
Similar content being viewed by others
Notes
A bigram is pair of adjacent characters in a string.
LSH-based collision counting has also been studied in [12], but the underlying theoretical foundations and the techniques used therein are completely different than ours.
The cumulative probability for pairs with \(d_H < \vartheta \) is less than 0.0523 due to the higher success probability yielded by the smaller distances than \(\vartheta \).
Collection \({\mathcal {U}}\) is instantiated by a HashBag object, which is contained in the Apache Commons package http://commons.apache.org/, for Java programming language.
Each record includes four fields; therefore, \(S=4 \times 500\).
For \(\textit{Pt}_1\), we apply an insert, a delete, and an edit operation, thus \(\vartheta \) for rBf should be set to \(30+30+40=100\) bits, while for \(\textit{Pt}_2\) to \(30+30+40+30+80=210\) bits due to the two additional operations.
For \(\textit{Pt}_1\), \(\vartheta \) for CLK should be set to \(15+15+25=55\) bits, while for \(\textit{Pt}_2\) to \(15+15+25+15+40=110\) bits.
Since we apply an insert, a delete, an edit, and a transpose operation, \(\vartheta \) should be set to \(30+30+40+80=180\) bits.
Counting the collisions for \(C'\) required storing the Id’s of both \(A'\) and \(B'\).
References
Aggarwal CC, Yu PS (2000) The igrid index: reversing the dimensionality curse for similarity indexing in high dimensional space. In: International conference on knowledge discovery and data mining, pp 119–129
Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun ACM 51(1):117–122
Bonomi L, Xiong L, Chen R, Fung BCM (2012) Frequent grams based embedding for privacy preserving record linkage. In: International conference on information and knowledge management, pp 1597–1601
Broder AZ, Charikar M, Frieze A, Mitzenmacher M (1998) Minwise independent permutations. In: Symposium on theory of computing, pp 327–336
Christen P (2012a) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 12(9):1537–1555
Christen P (2012b) Data matching—concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, Data-Centric Systems and Applications
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Symposium on computational geometry, pp 253–262
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Durham E (2012) A framework for accurate efficient private record linkage. PhD thesis, Vanderbilt University, USA
Durham E, Kantarcioglu M, Xue Y, Toth C, Kuzu M, Malin B (2014) Composite Bloom filters for secure record linkage. IEEE Trans Knowl Data Eng 26(12):2956–2968
Dwork C (2006) Differential privacy. In: Automata, languages and programming, international colloquium. Springer, Berlin Heidelberg, pp 1–12
Gan J, Feng J, Fang Q, Ng W (2012) Locality-sensitive hashing scheme based on dynamic collision counting. In: International conference on management of data, pp 541–552
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: International conference on very large databases, pp 518–529
Goodman J, O’Rourke J, Indyk P (2004) Handbook of discrete and computational geometry. CRC, Boca Raton
Hall R, Fienberg SE (2010) Privacy-preserving record linkage. In: International conference on privacy in statistical databases, pp 269–283
Hernandez MA, Stolfo SJ (1998) Real world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowl Discov 2(1):9–37
Inan A, Kantarcioglou M, Bertino E, Scannapieco M (2008) A hybrid approach to private record linkage. In: International conference on data engineering, pp 496–505
Inan A, Kantarcioglu M, Ghinita G, Bertino E (2010) Private record matching using differential privacy. In: International conference on extending database technology, pp 123–134
Karakasidis A, Verykios VS (2012) A sorted neighborhood approach to multidimensional privacy preserving blocking. In: International conference on data mining workshops, pp 937–944
Karapiperis D, Verykios VS (2013) A distributed framework for scaling up LSH-based computations in privacy preserving record linkage. In: Balkan conference in informatics, ACM, pp 102–109
Karapiperis D, Verykios VS (2014) A distributed near-optimal lsh-based framework for privacy-preserving record linkage. Comput Sci Inf Syst 11(2):745–763
Karapiperis D, Verykios VS (2015) An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. IEEE Trans Knowl Data Eng 27(4):909–921
Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. In: Statistical Data Analysis Based on the L1Norm, Reports of the Faculty of Mathematics and Informatics. Delft University of Technology. Elsevier, Amsterdam, pp 405–406
Kim H, Lee D (2010) Fast iterative hashed record linkage for large-scale data collections. In: International conference on extending database technology, pp 525–536
Kuzu M, Kantarcioglou M, Durham E, Malin B (2011) A constraint satisfaction cryptanalysis of Bloom filters in private record linkage. In: International conference on privacy enhancing technologies, pp 226–245
Kuzu M, Kantarcioglu M, Inan A, Bertino E, Durham E, Malin B (2013) Efficient privacy-aware record integration. In: International conference on extending database technology, pp 167–178
Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge
Niedermeyer F, Steinmetzer S, Kroll Martin M, Schnell R (2014) Cryptanalysis of basic Bloom filters used for privacy preserving record linkage. J Priv Confid 6(2)
Paillier P (1999) Public-key cryptosystems based on composite degree residuosity classes. In: Eurocrypt, pp 223–238
Pang C, Gu L, Hansen D, Maeder A (2009) Privacy-preserving fuzzy matching using a public reference table. Intell Patient Manag 189:71–89
Rajaraman A, Ullman JD (2010) Mining of massive datasets, chapter finding similar items. cambridge University Press, Cambridge
Scannapieco M, Figotin I, Bertino E, Elmagarmid AK (2007) Privacy preserving schema and data matching. In: International conference on management of data, pp 653–664
Schnell R, Bachteler T, Reiher J (2009) Privacy-preserving record linkage using Bloom filters. BMC Med Inform Decis Making 9(1)
Schnell R, Bachteler T, Reiher J (2011) A novel error-tolerant anonymous linking code. Tech. report WP-GRLC-2011-02, German Record Linkage Center
Vatsalan D, Christen P, Verykios V (2011) An efficient two-party protocol for approximate matching in private record linkage. In: Australasian data mining conference, pp 125–136
Vatsalan D, Christen P, Verykios V (2013a) Efficient two-party private blocking based on sorted nearest neighborhood clustering. In: International conference on information and knowledge management, pp 1949–1958
Vatsalan D, Christen P, Verykios VS (2013b) A taxonomy of privacy-preserving record linkage techniques. Inf Syst 38(6):946–969
Weber R, Schek H, Blott S (1998) A quantitative analysis and performance study for similarity search methods in high dimensional spaces. In: International conference on very large data bases, pp 194–205
Yakout M, Atallah MJ, Elmagarmid AK (2009) Efficient private record linkage. In: International conference on data engineering, pp 1283–1286
Author information
Authors and Affiliations
Corresponding author
Appendix: Solving for \(L_{{\mathcal {C}}}\) in (9)
Appendix: Solving for \(L_{{\mathcal {C}}}\) in (9)
We bound above the right-hand side of (9) in order to guarantee that the probability for a pair with \(d_H=\vartheta \) to exhibit less collisions than \({\mathcal {C}}\) is bounded by \(\delta \) as follows:
We expand \((L_{{\mathcal {C}}}\, p^{K}_{\vartheta }-({\mathcal {C}}-1))^2\) and derive the following quadratic equation:
where we finally solve for \(L_{{\mathcal {C}}}\) and keep only the equality, since we want to be as optimal as possible.
Rights and permissions
About this article
Cite this article
Karapiperis, D., Verykios, V.S. A fast and efficient Hamming LSH-based scheme for accurate linkage. Knowl Inf Syst 49, 861–884 (2016). https://doi.org/10.1007/s10115-016-0919-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-016-0919-y