, Volume 16, Issue 3, pp 227–236 | Cite as

Speeding up Privacy Preserving Record Linkage for Metric Space Similarity Measures

  • Ziad SehiliEmail author
  • Erhard Rahm


The analysis of person-related data in Big Data applications faces the tradeoff of finding useful results while preserving a high degree of privacy. This is especially challenging when person-related data from multiple sources need to be integrated and analyzed. Privacy-preserving record linkage (PPRL) addresses this problem by encoding sensitive attribute values such that the identification of persons is prevented but records can still be matched. In this paper we study how to improve the efficiency and scalability of PPRL by restricting the search space for matching encoded records. We focus on similarity measures for metric spaces and investigate the use of M‑trees as well as pivot-based solutions. Our evaluation shows that the new schemes outperform previous filter approaches by an order of magnitude.


Metric Space M-Tree Triangle Inequality Bloom Filter Record Linkage 


  1. 1.
    Agrawal R, Kiernan J, Srikant R, Xu Y (2002) Hippocratic databases. In: Proc. VLDB conf, pp 143–154Google Scholar
  2. 2.
    Bachteler T, Reiher J, Schnell R (2013) Similarity Filtering with Multibit Trees for Record Linkage. Tech. Rep. WP-GRLC-2013-01. German Record Linkage CenterGoogle Scholar
  3. 3.
    Bozkaya T, Özsoyoglu ZM (1999) Indexing large metric spaces for similarity search queries. ACM Trans Database Syst 24(3):361–404CrossRefGoogle Scholar
  4. 4.
    Brin S (1995) Near neighbor search in large metric spaces. In: Proc. VLDB conf, pp 574–584Google Scholar
  5. 5.
    Christen P (2005) Probabilistic Data Generation for Deduplication and Data Linkage. In: Proc. 6th Int. Conf. Intelligent Data Engineering and Automated Learning, pp 109–116Google Scholar
  6. 6.
    Christen P (2012) Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. SpringerGoogle Scholar
  7. 7.
    Ciaccia P, Patella M, Zezula P (1997) M‑tree: An efficient access method for similarity search in metric spaces. In: Proc. VLDB conf, pp 426–435Google Scholar
  8. 8.
    Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate Record Detection: A Survey. IEEE Trans Knowl Data Eng 19(1):1–16CrossRefGoogle Scholar
  9. 9.
    Fung B, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: A survey of recent developments. ACM Comput Surv 42(4):14CrossRefGoogle Scholar
  10. 10.
    Jiang Y, Li G, Feng J, Li WS (2014) String similarity joins: An experimental evaluation. PVLDB 7(8):625–636Google Scholar
  11. 11.
    Kirsch A, Mitzenmacher M (2006) Less Hashing, Same Performance: Building a Better Bloom Filter. In: Proc. ESA Symp, pp 456–467Google Scholar
  12. 12.
    Kolb L, Thor A, Rahm E (2012) Dedoop: Efficient Deduplication with Hadoop. PVLDB 5(12):1878–1881Google Scholar
  13. 13.
    Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1) 484--493Google Scholar
  14. 14.
    Kristensen TG, Nielsen J, Pedersen CNS (2010) A tree-based method for the rapid screening of chemical fingerprints. Algorithms Mol Biol 5:9CrossRefGoogle Scholar
  15. 15.
    Ngomo ACN, Auer S (2011) Limes-a time-efficient approach for large-scale link discovery on the web of data. In: Proc. IJCAIGoogle Scholar
  16. 16.
    Niedermeyer F, Steinmetzer S, Kroll M, Schnell R (2014) Cryptanalysis of basic bloom filters used for privacy preserving record linkage. J Priv Confidentiality 6(2):59–79Google Scholar
  17. 17.
    Scannapieco M, Figotin I, Bertino E, Elmagarmid AK (2007) Privacy preserving schema and data matching. In: Proc.ACM SIGMOD conf, pp 653–664Google Scholar
  18. 18.
    Schnell R, Bachteler T, Reiher J (2011) A Novel Error-Tolerant Anonymous Linking Code. Tech. Rep. WP-GRLC-2011-02. German Record Linkage Center, DuisburgGoogle Scholar
  19. 19.
    Sehili Z, Kolb L, Borgs C, Schnell R, Rahm E (2015) Privacy preserving record linkage with PPJoin. In: Proc. BTW, pp 85–104Google Scholar
  20. 20.
    Vaidya J, Zhu Y, Clifton CW (2006) Privacy Preserving Data Mining. Advances in Information Security, vol. 19. SpringerGoogle Scholar
  21. 21.
    Vatsalan D, Christen P, Verykios VS (2013) A taxonomy of privacy-preserving record linkage techniques. Inf Syst 38(6):946–969CrossRefGoogle Scholar
  22. 22.
    Xiao C, Wang W, Lin X, Yu JX (2008) Efficient Similarity Joins for Near Duplicate Detection. In: Proc. 17th Int. Conf. on World Wide Web, pp 131–140Google Scholar
  23. 23.
    Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach. SpringerGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Institut für InformatikUniversität Leipzig LeipzigGermany

Personalised recommendations