Advertisement

Post-processing Methods for High Quality Privacy-Preserving Record Linkage

  • Martin FrankeEmail author
  • Ziad Sehili
  • Marcel Gladbach
  • Erhard Rahm
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11025)

Abstract

Privacy-preserving record linkage (PPRL) supports the integration of person-related data from different sources while protecting the privacy of individuals by encoding sensitive information needed for linkage. The use of encoded data makes it challenging to achieve high linkage quality in particular for dirty data containing errors or inconsistencies. Moreover, person-related data is often dense, e.g., due to frequent names or addresses, leading to high similarities for non-matches. Both effects are hard to deal with in common PPRL approaches that rely on a simple threshold-based classification to decide whether a record pair is considered to match. In particular, dirty or dense data likely lead to many multi-links where persons are wrongly linked to more than one other person. Therefore, we propose the use of post-processing methods for resolving multi-links and outline three possible approaches. In our evaluation using large synthetic and real datasets we compare these approaches with each other and show that applying post-processing is highly beneficial and can significantly increase linkage quality in terms of both precision and F-measure.

Keywords

Record linkage Post-processing Privacy Linkage quality 

References

  1. 1.
    Bloom, B.: Space/time trade-offs in hash coding with allowable errors. CACM 13(7), 422–426 (1970)CrossRefGoogle Scholar
  2. 2.
    Böhm, C., de Melo, G., Naumann, F., Weikum, G.: LINDA: distributed web-of-data-scale entity matching. In: ACM CIKM, pp. 2104–2108 (2012)Google Scholar
  3. 3.
    Brown, A.P., Borgs, C., Randall, S.M., Schnell, R.: Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets. BMC Med. Inf. Decis. Making 17(1), 83 (2017)CrossRefGoogle Scholar
  4. 4.
    Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-31164-2CrossRefGoogle Scholar
  5. 5.
    Christen, P., Schnell, R., Vatsalan, D., Ranbaduge, T.: Efficient cryptanalysis of bloom filters for privacy-preserving record linkage. In: Kim, J., Shim, K., Cao, L., Lee, J.-G., Lin, X., Moon, Y.-S. (eds.) PAKDD 2017. LNCS (LNAI), vol. 10234, pp. 628–640. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-57454-7_49CrossRefGoogle Scholar
  6. 6.
    Christen, P., Vatsalan, D.: Flexible and extensible generation and corruption of personal data. In: ACM CIKM, pp. 1165–1168 (2013)Google Scholar
  7. 7.
    Do, H.H., Rahm, E.: COMA - a system for flexible combination of schema matching approaches. In: VLDB, pp. 610–621 (2002)Google Scholar
  8. 8.
    Durham, E.A.: A framework for accurate, efficient private record linkage. Ph.D. thesis, Vanderbilt University (2012)Google Scholar
  9. 9.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE TKDE 19(1), 1–16 (2007)Google Scholar
  10. 10.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. JASA 64(328), 1183–1210 (1969)CrossRefGoogle Scholar
  11. 11.
    Franke, M., Sehili, Z., Rahm, E.: Parallel privacy preserving record linkage using LSH-based blocking. In: IoTBDS, pp. 195–203 (2018)Google Scholar
  12. 12.
    Gale, D., Shapley, L.S.: College admissions and the stability of marriage. Am. Math. Mon. 69(1), 9–15 (1962)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Gibberd, A., Supramaniam, R., Dillon, A., Armstrong, B.K., OConnell, D.L.: Lung cancer treatment and mortality for Aboriginal people in New South Wales, Australia: results from a population-based record linkage study and medical record audit. BMC Cancer 16(1), 289 (2016)CrossRefGoogle Scholar
  14. 14.
    Gusfield, D., Irving, R.W.: The Stable Marriage Problem: Structure and Algorithms. MIT Press, Cambridge (1989)zbMATHGoogle Scholar
  15. 15.
    Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discovery 2(1), 9–37 (1998)CrossRefGoogle Scholar
  16. 16.
    Irving, R.W.: Stable marriage and indifference. Discrete Appl. Math. 48(3), 261–272 (1994)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Iwama, K., Miyazaki, S.: A survey of the stable marriage problem and its variants. In: IEEE ICKS, pp. 131–136 (2008)Google Scholar
  18. 18.
    Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)CrossRefGoogle Scholar
  19. 19.
    Karapiperis, D., Gkoulalas-Divanis, A., Verykios, V.S.: Distance-aware encoding of numerical values for privacy-preserving record linkage. In: IEEE ICDE, pp. 135–138 (2017)Google Scholar
  20. 20.
    Karapiperis, D., Gkoulalas-Divanis, A., Verykios, V.S.: FEDERAL: a framework for distance-aware privacy-preserving record linkage. IEEE TKDE 30(2), 292–304 (2018)Google Scholar
  21. 21.
    Karapiperis, D., Verykios, V.S.: A distributed framework for scaling up LSH-based computations in privacy preserving record linkage. In: Proceedings of the BCI (2013)Google Scholar
  22. 22.
    Karapiperis, D., Verykios, V.S.: A fast and efficient hamming LSH-based scheme for accurate linkage. KAIS 49(3), 861–884 (2016)Google Scholar
  23. 23.
    Kho, A.N., Cashy, J.P., Jackson, K.L., Pah, A.R., Goel, S., Boehnke, J., Humphries, J.E., Kominers, S.D., Hota, B.N., Sims, S.A., et al.: Design and implementation of a privacy preserving electronic health record linkage tool in Chicago. JAMIA 22(5), 1072–1080 (2015)Google Scholar
  24. 24.
    Kroll, M., Steinmetzer, S.: Automated cryptanalysis of bloom filter encryptions of health records. In: ICHI (2014)Google Scholar
  25. 25.
    Kuehni, C.E., et al.: Cohort profile: the Swiss childhood cancer survivor study. Int. J. Epidemiol. 41(6), 1553–1564 (2012)CrossRefGoogle Scholar
  26. 26.
    Kuzu, M., Kantarcioglu, M., Durham, E., Malin, B.: A constraint satisfaction cryptanalysis of bloom filters in private record linkage. In: Fischer-Hübner, S., Hopper, N. (eds.) PETS 2011. LNCS, vol. 6794, pp. 226–245. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-22263-4_13CrossRefGoogle Scholar
  27. 27.
    Kuzu, M., Kantarcioglu, M., Durham, E.A., Toth, C., Malin, B.: A practical approach to achieve private medical record linkage in light of public resources. JAMIA 20(2), 285–292 (2013)Google Scholar
  28. 28.
    Lenz, R.: Measuring the disclosure protection of micro aggregated business microdata: an analysis taking as an example the German structure of costs survey. J. Official Stat. 22(4), 681 (2006)Google Scholar
  29. 29.
    Luo, Q., et al.: Cancer-related hospitalisations and unknownstage prostate cancer: a population-based record linkage study. BMJ Open 7(1), e014259 (2017)CrossRefGoogle Scholar
  30. 30.
    Marie, A., Gal, A.: On the Stable Marriage of Maximum Weight Royal Couples. In: AAAI Workshop on Information Integration on the Web (2007)Google Scholar
  31. 31.
    McVitie, D.G., Wilson, L.B.: Stable marriage assignment for unequal sets. BIT Numer. Math. 10(3), 295–309 (1970)CrossRefGoogle Scholar
  32. 32.
    Meilicke, C., Stuckenschmidt, H.: Analyzing mapping extraction approaches. In: OM, pp. 25–36 (2007)Google Scholar
  33. 33.
    Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: IEEE ICDE, pp. 117–128 (2002)Google Scholar
  34. 34.
    Munkres, J.: Algorithms for the assignment and transportation problems. SIAM J. 5(1), 32–38 (1957)MathSciNetzbMATHGoogle Scholar
  35. 35.
    Niedermeyer, F., Steinmetzer, S., Kroll, M., Schnell, R.: Cryptanalysis of basic bloom filters used for privacy preserving record linkage. JPC 6(2), 59–79 (2014)CrossRefGoogle Scholar
  36. 36.
    Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)Google Scholar
  37. 37.
    Saeedi, A., Peukert, E., Rahm, E.: Using link features for entity clustering in knowledge graphs. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp. 576–592. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-93417-4_37CrossRefGoogle Scholar
  38. 38.
    Schnell, R.: Privacy-preserving record linkage. In: Methodological Developments in Data Linkage, pp. 201–225 (2015)Google Scholar
  39. 39.
    Schnell, R., Bachteler, T., Reiher, J.: Privacy-preserving record linkage using Bloom filters. BMC Med. Inf. Decis. Making 9(1), 41 (2009)CrossRefGoogle Scholar
  40. 40.
    Schnell, R., Bachteler, T., Reiher, J.: A novel error-tolerant anonymous linking code. GRLC, No. WP-GRLC-2011-02 (2011)Google Scholar
  41. 41.
    Schnell, R., Borgs, C.: Randomized response and balanced bloom filters for privacy preserving record linkage. In: IEEE ICDMW (2016)Google Scholar
  42. 42.
    Sehili, Z., Rahm, E.: Speeding up privacy preserving record linkage for metric space similarity measures. Datenbank-Spektrum 16(3), 227–236 (2016)CrossRefGoogle Scholar
  43. 43.
    Vatsalan, D., Christen, P., Verykios, V.S.: A taxonomy of privacy-preserving record linkage techniques. Inf. Syst. 38(6), 946–969 (2013)CrossRefGoogle Scholar
  44. 44.
    Vatsalan, D., Christen, P.: scalable privacy-preserving record linkage for multiple databases. In: ACM CIKM, pp. 1795–1798 (2014)Google Scholar
  45. 45.
    Vatsalan, D., Christen, P.: Privacy-preserving matching of similar patients. J. Biomed. Inf. 59, 285–298 (2016)CrossRefGoogle Scholar
  46. 46.
    Vatsalan, D., Christen, P., O’Keefe, C.M., Verykios, V.S.: An evaluation framework for privacy-preserving record linkage. JPC 6(1), 3 (2014)CrossRefGoogle Scholar
  47. 47.
    Vatsalan, D., Sehili, Z., Christen, P., Rahm, E.: Privacy-preserving record linkage for big data: current approaches and research challenges. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 851–895. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-49340-4_25CrossRefGoogle Scholar
  48. 48.
    West, D.B., et al.: Introduction to Graph Theory, vol. 2. Prentice Hall, Upper Saddle River (2001)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Martin Franke
    • 1
    Email author
  • Ziad Sehili
    • 1
  • Marcel Gladbach
    • 1
  • Erhard Rahm
    • 1
  1. 1.Database GroupUniversity of LeipzigLeipzigGermany

Personalised recommendations