A fast and efficient Hamming LSH-based scheme for accurate linkage

Karapiperis, Dimitrios; Verykios, Vassilios S.

doi:10.1007/s10115-016-0919-y

A fast and efficient Hamming LSH-based scheme for accurate linkage

Regular Paper
Published: 03 February 2016

Volume 49, pages 861–884, (2016)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

620 Accesses
20 Citations
3 Altmetric
Explore all metrics

Abstract

In this paper, we propose an efficient scheme for privacy-preserving record linkage by using the Hamming locality-sensitive hashing technique as the blocking mechanism and the Bloom filter-based encoding method for anonymizing the data sets at hand. We achieve highly accurate results and simultaneously reduce significantly the computational cost by minimizing the number of distance computations performed. Our scheme provides theoretical guarantees for identifying the similar anonymized record pairs by conducting redundant blocking and by performing a distance computation only if the corresponding anonymized record pair is formulated a specified number of times. A series of experiments illustrate the efficacy of our scheme in identifying the similar record pairs, while simultaneously keeping the running time exceptionally low.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Tutorial on Blocking Methods for Privacy-Preserving Record Linkage

Sorted Nearest Neighborhood Clustering for Efficient Private Blocking

A Comparison of Blocking Methods for Record Linkage

Notes

A bigram is pair of adjacent characters in a string.
http://tools.ietf.org/html/rfc2104.
LSH-based collision counting has also been studied in [12], but the underlying theoretical foundations and the techniques used therein are completely different than ours.
The cumulative probability for pairs with $d_H < \vartheta $ is less than 0.0523 due to the higher success probability yielded by the smaller distances than $\vartheta $.
Collection ${\mathcal {U}}$ is instantiated by a HashBag object, which is contained in the Apache Commons package http://commons.apache.org/, for Java programming language.
ftp://www.app.sboe.state.nc.us/data/.
http://dblp.uni-trier.de/xml/.
Each record includes four fields; therefore, $S=4 \times 500$.
For $\textit{Pt}_1$, we apply an insert, a delete, and an edit operation, thus $\vartheta $ for rBf should be set to $30+30+40=100$ bits, while for $\textit{Pt}_2$ to $30+30+40+30+80=210$ bits due to the two additional operations.
For $\textit{Pt}_1$, $\vartheta $ for CLK should be set to $15+15+25=55$ bits, while for $\textit{Pt}_2$ to $15+15+25+15+40=110$ bits.
Since we apply an insert, a delete, an edit, and a transpose operation, $\vartheta $ should be set to $30+30+40+80=180$ bits.
Counting the collisions for $C'$ required storing the Id’s of both $A'$ and $B'$.

References

Aggarwal CC, Yu PS (2000) The igrid index: reversing the dimensionality curse for similarity indexing in high dimensional space. In: International conference on knowledge discovery and data mining, pp 119–129
Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun ACM 51(1):117–122
Article Google Scholar
Bonomi L, Xiong L, Chen R, Fung BCM (2012) Frequent grams based embedding for privacy preserving record linkage. In: International conference on information and knowledge management, pp 1597–1601
Broder AZ, Charikar M, Frieze A, Mitzenmacher M (1998) Minwise independent permutations. In: Symposium on theory of computing, pp 327–336
Christen P (2012a) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 12(9):1537–1555
Article Google Scholar
Christen P (2012b) Data matching—concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, Data-Centric Systems and Applications
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Symposium on computational geometry, pp 253–262
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Durham E (2012) A framework for accurate efficient private record linkage. PhD thesis, Vanderbilt University, USA
Durham E, Kantarcioglu M, Xue Y, Toth C, Kuzu M, Malin B (2014) Composite Bloom filters for secure record linkage. IEEE Trans Knowl Data Eng 26(12):2956–2968
Dwork C (2006) Differential privacy. In: Automata, languages and programming, international colloquium. Springer, Berlin Heidelberg, pp 1–12
Gan J, Feng J, Fang Q, Ng W (2012) Locality-sensitive hashing scheme based on dynamic collision counting. In: International conference on management of data, pp 541–552
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: International conference on very large databases, pp 518–529
Goodman J, O’Rourke J, Indyk P (2004) Handbook of discrete and computational geometry. CRC, Boca Raton
MATH Google Scholar
Hall R, Fienberg SE (2010) Privacy-preserving record linkage. In: International conference on privacy in statistical databases, pp 269–283
Hernandez MA, Stolfo SJ (1998) Real world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowl Discov 2(1):9–37
Article Google Scholar
Inan A, Kantarcioglou M, Bertino E, Scannapieco M (2008) A hybrid approach to private record linkage. In: International conference on data engineering, pp 496–505
Inan A, Kantarcioglu M, Ghinita G, Bertino E (2010) Private record matching using differential privacy. In: International conference on extending database technology, pp 123–134
Karakasidis A, Verykios VS (2012) A sorted neighborhood approach to multidimensional privacy preserving blocking. In: International conference on data mining workshops, pp 937–944
Karapiperis D, Verykios VS (2013) A distributed framework for scaling up LSH-based computations in privacy preserving record linkage. In: Balkan conference in informatics, ACM, pp 102–109
Karapiperis D, Verykios VS (2014) A distributed near-optimal lsh-based framework for privacy-preserving record linkage. Comput Sci Inf Syst 11(2):745–763
Article Google Scholar
Karapiperis D, Verykios VS (2015) An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. IEEE Trans Knowl Data Eng 27(4):909–921
Article Google Scholar
Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. In: Statistical Data Analysis Based on the L1Norm, Reports of the Faculty of Mathematics and Informatics. Delft University of Technology. Elsevier, Amsterdam, pp 405–406
Kim H, Lee D (2010) Fast iterative hashed record linkage for large-scale data collections. In: International conference on extending database technology, pp 525–536
Kuzu M, Kantarcioglou M, Durham E, Malin B (2011) A constraint satisfaction cryptanalysis of Bloom filters in private record linkage. In: International conference on privacy enhancing technologies, pp 226–245
Kuzu M, Kantarcioglu M, Inan A, Bertino E, Durham E, Malin B (2013) Efficient privacy-aware record integration. In: International conference on extending database technology, pp 167–178
Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge
Book MATH Google Scholar
Niedermeyer F, Steinmetzer S, Kroll Martin M, Schnell R (2014) Cryptanalysis of basic Bloom filters used for privacy preserving record linkage. J Priv Confid 6(2)
Paillier P (1999) Public-key cryptosystems based on composite degree residuosity classes. In: Eurocrypt, pp 223–238
Pang C, Gu L, Hansen D, Maeder A (2009) Privacy-preserving fuzzy matching using a public reference table. Intell Patient Manag 189:71–89
Article Google Scholar
Rajaraman A, Ullman JD (2010) Mining of massive datasets, chapter finding similar items. cambridge University Press, Cambridge
Google Scholar
Scannapieco M, Figotin I, Bertino E, Elmagarmid AK (2007) Privacy preserving schema and data matching. In: International conference on management of data, pp 653–664
Schnell R, Bachteler T, Reiher J (2009) Privacy-preserving record linkage using Bloom filters. BMC Med Inform Decis Making 9(1)
Schnell R, Bachteler T, Reiher J (2011) A novel error-tolerant anonymous linking code. Tech. report WP-GRLC-2011-02, German Record Linkage Center
Vatsalan D, Christen P, Verykios V (2011) An efficient two-party protocol for approximate matching in private record linkage. In: Australasian data mining conference, pp 125–136
Vatsalan D, Christen P, Verykios V (2013a) Efficient two-party private blocking based on sorted nearest neighborhood clustering. In: International conference on information and knowledge management, pp 1949–1958
Vatsalan D, Christen P, Verykios VS (2013b) A taxonomy of privacy-preserving record linkage techniques. Inf Syst 38(6):946–969
Article Google Scholar
Weber R, Schek H, Blott S (1998) A quantitative analysis and performance study for similarity search methods in high dimensional spaces. In: International conference on very large data bases, pp 194–205
Yakout M, Atallah MJ, Elmagarmid AK (2009) Efficient private record linkage. In: International conference on data engineering, pp 1283–1286

Download references

Author information

Authors and Affiliations

School of Science and Technology, Hellenic Open University, Patras, Greece
Dimitrios Karapiperis & Vassilios S. Verykios

Authors

Dimitrios Karapiperis
View author publications
You can also search for this author in PubMed Google Scholar
Vassilios S. Verykios
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimitrios Karapiperis.

Appendix: Solving for $L_{{\mathcal {C}}}$ in (9)

We bound above the right-hand side of (9) in order to guarantee that the probability for a pair with $d_H=\vartheta $ to exhibit less collisions than ${\mathcal {C}}$ is bounded by $\delta $ as follows:

$$\begin{aligned} \exp \left\{ -\frac{(L_{{\mathcal {C}}} \, p^{K}_{\vartheta }-({\mathcal {C}}-1))^2}{2 \, L_{{\mathcal {C}}} \, p^{K}_{\vartheta }}\right\} \le \delta \Leftrightarrow -\frac{(L_{{\mathcal {C}}} \, p^{K}_{\vartheta }-({\mathcal {C}}-1))^2}{2 \, L_{{\mathcal {C}}} \, p^{K}_{\vartheta }} \le \ln (\delta ). \end{aligned}$$

(11)

We expand $(L_{{\mathcal {C}}}\, p^{K}_{\vartheta }-({\mathcal {C}}-1))^2$ and derive the following quadratic equation:

$$\begin{aligned} (p^{K}_{\vartheta } \, L_{{\mathcal {C}}})^2 + (2 \, p^{K}_{\vartheta } \, \ln (\delta ) - 2 \, p^{K}_{\vartheta } ({\mathcal {C}}-1) )L_{{\mathcal {C}}}+({\mathcal {C}}-1)^2 \ge 0, \end{aligned}$$

(12)

where we finally solve for $L_{{\mathcal {C}}}$ and keep only the equality, since we want to be as optimal as possible.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Karapiperis, D., Verykios, V.S. A fast and efficient Hamming LSH-based scheme for accurate linkage. Knowl Inf Syst 49, 861–884 (2016). https://doi.org/10.1007/s10115-016-0919-y

Download citation

Received: 13 October 2014
Revised: 03 December 2015
Accepted: 20 January 2016
Published: 03 February 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s10115-016-0919-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A fast and efficient Hamming LSH-based scheme for accurate linkage

Abstract

Access this article

Similar content being viewed by others

A Tutorial on Blocking Methods for Privacy-Preserving Record Linkage

Sorted Nearest Neighborhood Clustering for Efficient Private Blocking

A Comparison of Blocking Methods for Record Linkage

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix: Solving for \(L_{{\mathcal {C}}}\) in (9)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A fast and efficient Hamming LSH-based scheme for accurate linkage

Abstract

Access this article

Similar content being viewed by others

A Tutorial on Blocking Methods for Privacy-Preserving Record Linkage

Sorted Nearest Neighborhood Clustering for Efficient Private Blocking

A Comparison of Blocking Methods for Record Linkage

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix: Solving for \(L_{{\mathcal {C}}}\) in (9)

Appendix: Solving for \(L_{{\mathcal {C}}}\) in (9)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation