Skip to main content

A Comparison of Blocking Methods for Record Linkage

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNISA,volume 8744)

Abstract

Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sensitive hashing, sometimes referred to as “private blocking.” We compare these approaches in terms of their recall, reduction ratio, and computational complexity. We evaluate these methods using different synthetic datafiles and conclude with a discussion of privacy-related issues.

Keywords

  • Hash Function
  • Reduction Ratio
  • Record Linkage
  • High Recall
  • Random Projection

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-11257-2_20
  • Chapter length: 16 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   54.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-11257-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   69.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Christen, P.: Probabilistic Data Generation for Deduplication and Data Linkage. In: Gallagher, M., Hogan, J.P., Maire, F. (eds.) IDEAL 2005. LNCS, vol. 3578, pp. 109–116. Springer, Heidelberg (2005)

    CrossRef  Google Scholar 

  2. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering 24 (2012)

    Google Scholar 

  3. Christen, P., Pudjijono, A.: Accurate synthetic generation of realistic personal information. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 507–514. Springer, Heidelberg (2009)

    CrossRef  Google Scholar 

  4. Christen, P., Pudjijono, A.: Accurate Synthetic Generation of Realistic Personal Information. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 507–514. Springer, Heidelberg (2009), http://dx.doi.org/10.1007/978-3-642-01307-2_47

    CrossRef  Google Scholar 

  5. Christen, P., Vatsalan, D.: Flexible and Extensible Generation and Corruption of Personal Data. In: Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM 2013) (2013)

    Google Scholar 

  6. Clauset, A., Newman, M.E., Moore, C.: Finding community structure in very large networks. Physical Review E 70, 066111 (2004)

    Google Scholar 

  7. Durham, E.A.: A framework for accurate, efficient private record linkage. Ph.D. thesis, Vanderbilt University (2012)

    Google Scholar 

  8. Fortunato, S.: Community detection in graphs. Physics Reports 486, 75–174 (2010)

    CrossRef  MathSciNet  Google Scholar 

  9. Goldenberg, A., Zheng, A.X., Fienberg, S.E., Airoldi, E.M.: A survey of statistical network models. Foundations and Trends® in Machine Learning 2, 129–233 (2010)

    CrossRef  Google Scholar 

  10. Hall, R., Fienberg, S.: Valid statistical inference on automatically matched files. In: Domingo-Ferrer, J., Tinnirello, I. (eds.) PSD 2012. LNCS, vol. 7556, pp. 131–142. Springer, Heidelberg (2012)

    CrossRef  Google Scholar 

  11. Herzog, T., Scheuren, F., Winkler, W.: Data Quality and Record Linkage Techniques. Springer, New York (2007)

    MATH  Google Scholar 

  12. Herzog, T., Scheuren, F., Winkler, W.: Record linkage. Wiley Interdisciplinary Reviews: Computational Statistics 2 (2010), doi:10.1002/wics.108

    Google Scholar 

  13. Karakasidis, A., Verykios, V.S.: Reference table based k-anonymous private blocking. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing, pp. 859–864. ACM (2012)

    Google Scholar 

  14. Kuzu, M., Kantarcioglu, M., Durham, E., Malin, B.: A constraint satisfaction cryptanalysis of bloom filters in private record linkage. In: Fischer-Hübner, S., Hopper, N. (eds.) PETS 2011. LNCS, vol. 6794, pp. 226–245. Springer, Heidelberg (2011)

    CrossRef  Google Scholar 

  15. Liang, H., Wang, Y., Christen, P., Gayler, R.: Noise-tolerant approximate blocking for dynamic real-time entity resolution. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014, Part II. LNCS, vol. 8444, pp. 449–460. Springer, Heidelberg (2014)

    CrossRef  Google Scholar 

  16. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178. ACM (2000)

    Google Scholar 

  17. Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty and citation matching. In: Advances in Neural Information Processing Systems, pp. 1425–1432 (2003)

    Google Scholar 

  18. Paulevé, L., Jégou, H., Amsaleg, L.: Locality sensitive hashing: A comparison of hash function types and querying mechanisms. Pattern Recognition Letters 31, 1348–1358 (2010)

    CrossRef  Google Scholar 

  19. Rajaraman, A., Ullman, J.D.: Mining of massive datasets. Cambridge University Press (2012)

    Google Scholar 

  20. Steorts, R.C., Hall, R., Fienberg, S.: A Bayesian approach to graphical record linkage and de-duplication (2013) (submitted)

    Google Scholar 

  21. Vatsalan, D., Christen, P.: Sorted nearest neighborhood clustering for efficient private blocking. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013, Part II. LNCS, vol. 7819, pp. 341–352. Springer, Heidelberg (2013)

    CrossRef  Google Scholar 

  22. Vatsalan, D., Christen, P., Verykios, V.S.: An efficient two-party protocol for approximate matching in private record linkage. In: Proceedings of the Ninth Australasian Data Mining Conference, vol. 121, pp. 125–136. Australian Computer Society, Inc. (2011)

    Google Scholar 

  23. Winkler, W., Yancey, W., Porter, E.: Fast record linkage of very large files in support of decennial and administrative records projects. In: Proceedings of American Statistical Association Section on Survey Research Methods (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E. (2014). A Comparison of Blocking Methods for Record Linkage. In: Domingo-Ferrer, J. (eds) Privacy in Statistical Databases. PSD 2014. Lecture Notes in Computer Science, vol 8744. Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11257-2_20

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11256-5

  • Online ISBN: 978-3-319-11257-2

  • eBook Packages: Computer ScienceComputer Science (R0)