Sorted Nearest Neighborhood Clustering for Efficient Private Blocking

  • Dinusha Vatsalan
  • Peter Christen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7819)

Abstract

Record linkage is an emerging research area which is required by various real-world applications to identify which records in different data sources refer to the same real-world entities. Often privacy concerns and restrictions prevent the use of traditional record linkage applications across different organizations. Linking records in situations where no private or confidential information can be revealed is known as privacy-preserving record linkage (PPRL). As with traditional record linkage applications, scalability is a main challenge in PPRL. This challenge is generally addressed by employing a blocking technique that aims to reduce the number of candidate record pairs by removing record pairs that likely refer to non-matches without comparing them in detail. This paper presents an efficient private blocking technique based on a sorted neighborhood approach that combines k-anonymous clustering and the use of public reference values. An empirical study conducted on real-world databases shows that this approach is scalable to large databases, and that it can provide effective blocking while preserving k-anonymous characteristics. The proposed approach can be up-to two orders of magnitude faster than two state-of-the-art private blocking techniques, k-nearest neighbor clustering and Hamming based locality sensitive hashing.

Keywords

sorted neighborhood nearest neighbor clustering locality sensitive hashing k-anonymity reference values scalability 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Christen, P.: Data Matching. Data-Centric Systems and Appl. Springer (2012)Google Scholar
  2. 2.
    Batini, C., Scannapieca, M.: Data quality: Concepts, methodologies and techniques. In: Data-Centric Systems and Appl. Springer (2006)Google Scholar
  3. 3.
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering 12(9) (2012)Google Scholar
  4. 4.
    Vatsalan, D., Christen, P., Verykios, V.: A taxonomy of privacy-preserving record linkage techniques. Information Systems (2013)Google Scholar
  5. 5.
    Hall, R., Fienberg, S.: Privacy-preserving record linkage. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 269–283. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. 6.
    Churches, T., Christen, P.: Blind data linkage using n-gram similarity comparisons. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 121–126. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  7. 7.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Society 64(328), 1183–1210 (1969)CrossRefGoogle Scholar
  8. 8.
    Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: DASFAA 2003, pp. 137–146 (2003)Google Scholar
  9. 9.
    Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: ACM SIGKDD, pp. 475–480 (2002)Google Scholar
  10. 10.
    Kim, H., Lee, D.: Harra: fast iterative hashed record linkage for large-scale data collections. In: EDBT, Lausanne, Switzerland, pp. 525–536 (2010)Google Scholar
  11. 11.
    Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)CrossRefGoogle Scholar
  12. 12.
    Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive windows for duplicate detection. In: ICDE, pp. 1073–1083 (2012)Google Scholar
  13. 13.
    Sweeney, L.: K-anonymity: A model for protecting privacy. International Journal of Uncertainty Fuzziness and Knowledge Based Systems 10(5), 557–570 (2002)MathSciNetMATHCrossRefGoogle Scholar
  14. 14.
    Pang, C., Gu, L., Hansen, D., Maeder, A.: Privacy-preserving fuzzy matching using a public reference table. In: McClean, S., Millard, P., El-Darzi, E., Nugent, C. (eds.) Intelligent Patient Management. Studies in Computational Intelligence, vol. 189, pp. 71–89. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  15. 15.
    Karakasidis, A., Verykios, V.: Reference table based k-anonymous private blocking. In: ACM Symposium on Applied Computing, Riva del Garda, Italy (2012)Google Scholar
  16. 16.
    Durham, E.: A framework for accurate, efficient private record linkage. PhD thesis, Vanderbilt University (2012)Google Scholar
  17. 17.
    Al-Lawati, A., Lee, D., McDaniel, P.: Blocking-aware private record linkage. In: IQIS, pp. 59–68 (2005)Google Scholar
  18. 18.
    Inan, A., Kantarcioglu, M., Bertino, E., Scannapieco, M.: A hybrid approach to private record linkage. In: IEEE ICDE, Cancun, Mexico, pp. 496–505 (2008)Google Scholar
  19. 19.
    Inan, A., Kantarcioglu, M., Ghinita, G., Bertino, E.: Private record matching using differential privacy. In: EDBT (2010)Google Scholar
  20. 20.
    Karakasidis, A., Verykios, V., Christen, P.: Fake injection strategies for private phonetic matching. In: DPM, Leuven, Belgium (2011)Google Scholar
  21. 21.
    Vatsalan, D., Christen, P., Verykios, V.: An efficient two-party protocol for approximate matching in private record linkage. In: AusDM, CRPIT 121 (2011)Google Scholar
  22. 22.
    Scannapieco, M., Figotin, I., Bertino, E., Elmagarmid, A.: Privacy preserving schema and data matching. In: ACM SIGMOD, pp. 653–664 (2007)Google Scholar
  23. 23.
    Yakout, M., Atallah, M., Elmagarmid, A.: Efficient private record linkage. In: IEEE ICDE, Shanghai, pp. 1283–1286 (2009)Google Scholar
  24. 24.
    Schnell, R., Bachteler, T., Reiher, J.: Privacy-preserving record linkage using Bloom filters. BMC Medical Informatics and Decision Making 9(1) (2009)Google Scholar
  25. 25.
    Christen, P., Pudjijono, A.: Accurate synthetic generation of realistic personal information. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 507–514. Springer, Heidelberg (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Dinusha Vatsalan
    • 1
  • Peter Christen
    • 1
  1. 1.Research School of Computer Science, College of Engineering and Computer ScienceThe Australian National UniversityCanberraAustralia

Personalised recommendations