Skip to main content
Log in

A fast and efficient Hamming LSH-based scheme for accurate linkage

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In this paper, we propose an efficient scheme for privacy-preserving record linkage by using the Hamming locality-sensitive hashing technique as the blocking mechanism and the Bloom filter-based encoding method for anonymizing the data sets at hand. We achieve highly accurate results and simultaneously reduce significantly the computational cost by minimizing the number of distance computations performed. Our scheme provides theoretical guarantees for identifying the similar anonymized record pairs by conducting redundant blocking and by performing a distance computation only if the corresponding anonymized record pair is formulated a specified number of times. A series of experiments illustrate the efficacy of our scheme in identifying the similar record pairs, while simultaneously keeping the running time exceptionally low.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. A bigram is pair of adjacent characters in a string.

  2. http://tools.ietf.org/html/rfc2104.

  3. LSH-based collision counting has also been studied in [12], but the underlying theoretical foundations and the techniques used therein are completely different than ours.

  4. The cumulative probability for pairs with \(d_H < \vartheta \) is less than 0.0523 due to the higher success probability yielded by the smaller distances than \(\vartheta \).

  5. Collection \({\mathcal {U}}\) is instantiated by a HashBag object, which is contained in the Apache Commons package http://commons.apache.org/, for Java programming language.

  6. ftp://www.app.sboe.state.nc.us/data/.

  7. http://dblp.uni-trier.de/xml/.

  8. Each record includes four fields; therefore, \(S=4 \times 500\).

  9. For \(\textit{Pt}_1\), we apply an insert, a delete, and an edit operation, thus \(\vartheta \) for rBf should be set to \(30+30+40=100\) bits, while for \(\textit{Pt}_2\) to \(30+30+40+30+80=210\) bits due to the two additional operations.

  10. For \(\textit{Pt}_1\), \(\vartheta \) for CLK should be set to \(15+15+25=55\) bits, while for \(\textit{Pt}_2\) to \(15+15+25+15+40=110\) bits.

  11. Since we apply an insert, a delete, an edit, and a transpose operation, \(\vartheta \) should be set to \(30+30+40+80=180\) bits.

  12. Counting the collisions for \(C'\) required storing the Id’s of both \(A'\) and \(B'\).

References

  1. Aggarwal CC, Yu PS (2000) The igrid index: reversing the dimensionality curse for similarity indexing in high dimensional space. In: International conference on knowledge discovery and data mining, pp 119–129

  2. Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun ACM 51(1):117–122

    Article  Google Scholar 

  3. Bonomi L, Xiong L, Chen R, Fung BCM (2012) Frequent grams based embedding for privacy preserving record linkage. In: International conference on information and knowledge management, pp 1597–1601

  4. Broder AZ, Charikar M, Frieze A, Mitzenmacher M (1998) Minwise independent permutations. In: Symposium on theory of computing, pp 327–336

  5. Christen P (2012a) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 12(9):1537–1555

    Article  Google Scholar 

  6. Christen P (2012b) Data matching—concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, Data-Centric Systems and Applications

  7. Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Symposium on computational geometry, pp 253–262

  8. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  9. Durham E (2012) A framework for accurate efficient private record linkage. PhD thesis, Vanderbilt University, USA

  10. Durham E, Kantarcioglu M, Xue Y, Toth C, Kuzu M, Malin B (2014) Composite Bloom filters for secure record linkage. IEEE Trans Knowl Data Eng 26(12):2956–2968

  11. Dwork C (2006) Differential privacy. In: Automata, languages and programming, international colloquium. Springer, Berlin Heidelberg, pp 1–12

  12. Gan J, Feng J, Fang Q, Ng W (2012) Locality-sensitive hashing scheme based on dynamic collision counting. In: International conference on management of data, pp 541–552

  13. Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: International conference on very large databases, pp 518–529

  14. Goodman J, O’Rourke J, Indyk P (2004) Handbook of discrete and computational geometry. CRC, Boca Raton

    MATH  Google Scholar 

  15. Hall R, Fienberg SE (2010) Privacy-preserving record linkage. In: International conference on privacy in statistical databases, pp 269–283

  16. Hernandez MA, Stolfo SJ (1998) Real world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowl Discov 2(1):9–37

    Article  Google Scholar 

  17. Inan A, Kantarcioglou M, Bertino E, Scannapieco M (2008) A hybrid approach to private record linkage. In: International conference on data engineering, pp 496–505

  18. Inan A, Kantarcioglu M, Ghinita G, Bertino E (2010) Private record matching using differential privacy. In: International conference on extending database technology, pp 123–134

  19. Karakasidis A, Verykios VS (2012) A sorted neighborhood approach to multidimensional privacy preserving blocking. In: International conference on data mining workshops, pp 937–944

  20. Karapiperis D, Verykios VS (2013) A distributed framework for scaling up LSH-based computations in privacy preserving record linkage. In: Balkan conference in informatics, ACM, pp 102–109

  21. Karapiperis D, Verykios VS (2014) A distributed near-optimal lsh-based framework for privacy-preserving record linkage. Comput Sci Inf Syst 11(2):745–763

    Article  Google Scholar 

  22. Karapiperis D, Verykios VS (2015) An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. IEEE Trans Knowl Data Eng 27(4):909–921

    Article  Google Scholar 

  23. Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. In: Statistical Data Analysis Based on the L1Norm, Reports of the Faculty of Mathematics and Informatics. Delft University of Technology. Elsevier, Amsterdam, pp 405–406

  24. Kim H, Lee D (2010) Fast iterative hashed record linkage for large-scale data collections. In: International conference on extending database technology, pp 525–536

  25. Kuzu M, Kantarcioglou M, Durham E, Malin B (2011) A constraint satisfaction cryptanalysis of Bloom filters in private record linkage. In: International conference on privacy enhancing technologies, pp 226–245

  26. Kuzu M, Kantarcioglu M, Inan A, Bertino E, Durham E, Malin B (2013) Efficient privacy-aware record integration. In: International conference on extending database technology, pp 167–178

  27. Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  28. Niedermeyer F, Steinmetzer S, Kroll Martin M, Schnell R (2014) Cryptanalysis of basic Bloom filters used for privacy preserving record linkage. J Priv Confid 6(2)

  29. Paillier P (1999) Public-key cryptosystems based on composite degree residuosity classes. In: Eurocrypt, pp 223–238

  30. Pang C, Gu L, Hansen D, Maeder A (2009) Privacy-preserving fuzzy matching using a public reference table. Intell Patient Manag 189:71–89

    Article  Google Scholar 

  31. Rajaraman A, Ullman JD (2010) Mining of massive datasets, chapter finding similar items. cambridge University Press, Cambridge

    Google Scholar 

  32. Scannapieco M, Figotin I, Bertino E, Elmagarmid AK (2007) Privacy preserving schema and data matching. In: International conference on management of data, pp 653–664

  33. Schnell R, Bachteler T, Reiher J (2009) Privacy-preserving record linkage using Bloom filters. BMC Med Inform Decis Making 9(1)

  34. Schnell R, Bachteler T, Reiher J (2011) A novel error-tolerant anonymous linking code. Tech. report WP-GRLC-2011-02, German Record Linkage Center

  35. Vatsalan D, Christen P, Verykios V (2011) An efficient two-party protocol for approximate matching in private record linkage. In: Australasian data mining conference, pp 125–136

  36. Vatsalan D, Christen P, Verykios V (2013a) Efficient two-party private blocking based on sorted nearest neighborhood clustering. In: International conference on information and knowledge management, pp 1949–1958

  37. Vatsalan D, Christen P, Verykios VS (2013b) A taxonomy of privacy-preserving record linkage techniques. Inf Syst 38(6):946–969

    Article  Google Scholar 

  38. Weber R, Schek H, Blott S (1998) A quantitative analysis and performance study for similarity search methods in high dimensional spaces. In: International conference on very large data bases, pp 194–205

  39. Yakout M, Atallah MJ, Elmagarmid AK (2009) Efficient private record linkage. In: International conference on data engineering, pp 1283–1286

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimitrios Karapiperis.

Appendix: Solving for \(L_{{\mathcal {C}}}\) in (9)

Appendix: Solving for \(L_{{\mathcal {C}}}\) in (9)

We bound above the right-hand side of (9) in order to guarantee that the probability for a pair with \(d_H=\vartheta \) to exhibit less collisions than \({\mathcal {C}}\) is bounded by \(\delta \) as follows:

$$\begin{aligned} \exp \left\{ -\frac{(L_{{\mathcal {C}}} \, p^{K}_{\vartheta }-({\mathcal {C}}-1))^2}{2 \, L_{{\mathcal {C}}} \, p^{K}_{\vartheta }}\right\} \le \delta \Leftrightarrow -\frac{(L_{{\mathcal {C}}} \, p^{K}_{\vartheta }-({\mathcal {C}}-1))^2}{2 \, L_{{\mathcal {C}}} \, p^{K}_{\vartheta }} \le \ln (\delta ). \end{aligned}$$
(11)

We expand \((L_{{\mathcal {C}}}\, p^{K}_{\vartheta }-({\mathcal {C}}-1))^2\) and derive the following quadratic equation:

$$\begin{aligned} (p^{K}_{\vartheta } \, L_{{\mathcal {C}}})^2 + (2 \, p^{K}_{\vartheta } \, \ln (\delta ) - 2 \, p^{K}_{\vartheta } ({\mathcal {C}}-1) )L_{{\mathcal {C}}}+({\mathcal {C}}-1)^2 \ge 0, \end{aligned}$$
(12)

where we finally solve for \(L_{{\mathcal {C}}}\) and keep only the equality, since we want to be as optimal as possible.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Karapiperis, D., Verykios, V.S. A fast and efficient Hamming LSH-based scheme for accurate linkage. Knowl Inf Syst 49, 861–884 (2016). https://doi.org/10.1007/s10115-016-0919-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-016-0919-y

Keywords

Navigation