Skip to main content

A Review of Unsupervised and Semi-supervised Blocking Methods for Record Linkage

  • Chapter
  • First Online:
Linking and Mining Heterogeneous and Multi-view Data

Abstract

Record linkage, referred to also as entity resolution, is a process of identifying records representing the same real-world entity (e.g. a person) across varied data sources. To reduce the computational complexity associated with record comparisons, a task referred to as blocking is commonly performed prior to the linkage process. The blocking task involves partitioning records into blocks of records and treating records from different blocks as not related to the same entity. Following this, record linkage methods are applied within each block significantly reducing the number of record comparisons. Most of the existing blocking techniques require some degree of parameter selection in order to optimise the performance for a particular dataset (e.g. attributes and blocking functions used for splitting records into blocks). Optimal parameters can be selected manually but this is expensive in terms of time and cost and assumes a domain expert to be available. Automatic supervised blocking techniques have been proposed; however, they require a set of labelled data in which the matching status of each record is known. In the majority of real-world scenarios, we do not have any information regarding the matching status of records obtained from multiple sources. Therefore, there is a demand for blocking techniques that sufficiently reduce the number of record comparisons with little to no human input or labelled data required. Given the importance of the problem, recent research efforts have seen the development of novel unsupervised and semi-supervised blocking techniques. In this chapter, we review existing blocking techniques and discuss their advantages and disadvantages. We detail other research areas that have recently arose and discuss other unresolved issues that are still to be addressed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aizawa, A., Oyama, K.: A fast linkage detection scheme for multi-source information integration. In: Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration, WIRI’05, pp. 30–39. IEEE, Piscataway (2005)

    Google Scholar 

  2. Atencia, M., David, J., Scharffe, F.: Keys and pseudo-keys detection for web datasets cleansing and interlinking. In: International Conference on Knowledge Engineering and Knowledge Management, pp. 144–153. Springer, Berlin (2012)

    Google Scholar 

  3. Babu, B.V., Santoshi, K.J.: Unsupervised detection of duplicates in user query results using blocking. In: International Journal of Computer Science and Information Technologies, 5(3), 3514–3520. IJCSIT (2014)

    Google Scholar 

  4. Baxter, R., Christen, P., Churches, T., et al.: A comparison of fast blocking methods for record linkage. In: ACM SIGKDD, vol. 3, pp. 25–27. Citeseer (2003)

    Google Scholar 

  5. Bertolazzi, P., De Santis, L., Scannapieco, M.: Automatic record matching in cooperative information systems. In: Proceedings of the International Workshop on Data Quality in Cooperative Information Systems (DQCIS), p. 9 (2003)

    Google Scholar 

  6. Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: Sixth International Conference on Data Mining, ICDM’06, pp. 87–96. IEEE, Piscataway (2006)

    Google Scholar 

  7. Christen, P.: Improving data linkage and deduplication quality through nearest-neighbour based blocking. In: Department of Computer Science, The Australian National University. ACM (2007)

    Google Scholar 

  8. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Berlin (2012)

    Book  Google Scholar 

  9. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)

    Article  Google Scholar 

  10. Christen, P., Churches, T., et al.: Febrl-freely extensible biomedical record linkage. In: Department of Computer Science, Australian National University (2002)

    Google Scholar 

  11. Cui, M.: Towards a scalable and robust entity resolution-approximate blocking with semantic constraints. In: COMP8740: Artificial Intelligence Project, Australian National University (2014)

    Google Scholar 

  12. dal Bianco, G., Gonçalves, M.A., Duarte, D.: Bloss: effective meta-blocking with almost no effort. Inf. Syst. 75, 75–89 (2018)

    Google Scholar 

  13. Bilenko, M.: Learnable similarity functions and their application to clustering and record linkage. In: Proceedings of the Nineteenth National Conference on Artificial Intelligence, pp. 981–982 (2004)

    Google Scholar 

  14. Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive windows for duplicate detection. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 1073–1083. IEEE, Piscataway (2012)

    Google Scholar 

  15. Duda, R.O., Hart, P.E., Stork, D.G., et al.: Pattern Classification, 2nd edn, p. 55. Wiley, New York (2001)

    Google Scholar 

  16. Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: Tailor: a record linkage toolbox. In: Proceedings of 18th International Conference on Data Engineering, pp. 17–28. IEEE, Piscataway (2002)

    Google Scholar 

  17. Faloutsos, C., Lin, K.I.: FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets, vol. 24. ACM, New York (1995)

    Google Scholar 

  18. Fisher, J., Christen, P., Wang, Q., Rahm, E.: A clustering-based framework to control block sizes for entity resolution. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 279–288. ACM, New York (2015)

    Google Scholar 

  19. Gionis, A., Indyk, P., Motwani, R., et al.: Similarity search in high dimensions via hashing. In: VLDB’99 Proceedings of the 25th International Conference on Very Large Data Bases, vol. 99, pp. 518–529 (1999)

    Google Scholar 

  20. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D., et al.: Approximate string joins in a database (almost) for free. In: Proceeding VLDB ’01 Proceedings of the 27th International Conference on Very Large Data Bases, vol. 1, pp. 491–500 (2001)

    Google Scholar 

  21. Guillet, F., Hamilton, H.J.: Quality Measures in Data Mining, vol. 43. Springer, Berlin (2007)

    Book  Google Scholar 

  22. Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Disc. 2(1), 9–37 (1998)

    Article  Google Scholar 

  23. Herschel, M., Naumann, F., Szott, S., Taubert, M.: Scalable iterative graph duplicate detection. IEEE Trans. Knowl. Data Eng. 24(11), 2094–2108 (2012)

    Article  Google Scholar 

  24. Hristescu, G., Farach-Colton, M.: Cluster-preserving embedding of proteins. Technical Report 99-50, Computer Science Department, Rutgers University (1999)

    Google Scholar 

  25. Ioannou, E., Papapetrou, O., Skoutas, D., Nejdl, W.: Efficient semantic-aware detection of near duplicate resources. In: Extended Semantic Web Conference, pp. 136–150. Springer, Berlin (2010)

    Google Scholar 

  26. Isele, R.: Learning expressive linkage rules for entity matching using genetic programming. Ph.D. thesis (2013)

    Google Scholar 

  27. Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of the Eighth International Conference on Database Systems for Advanced Applications, DASFAA ’03, p. 137. IEEE Computer Society, Washington (2003). http://dl.acm.org/citation.cfm?id=789081.789250

  28. Jonker, R., Volgenant, T.: Improving the Hungarian assignment algorithm. Oper. Res. Lett. 5(4), 171–175 (1986)

    Article  MathSciNet  Google Scholar 

  29. Karakasidis, A., Verykios, V.S.: A sorted neighborhood approach to multidimensional privacy preserving blocking. In: IEEE 12th International Conference on Data Mining Workshops (ICDMW), pp. 937–944. IEEE, Piscataway (2012)

    Google Scholar 

  30. Karakasidis, A., Koloniari, G., Verykios, V.S.: Scalable blocking for privacy preserving record linkage. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 527–536. ACM, New York (2015)

    Google Scholar 

  31. Karapiperis, D., Verykios, V.S.: An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. IEEE Trans. Knowl. Data Eng. 27(4), 909–921 (2015)

    Article  Google Scholar 

  32. Kejriwal, M., Miranker, D.P.: A two-step blocking scheme learner for scalable link discovery. In: CEUR Workshop Proceedings. Vol 1317, pp. 49–60 (2014)

    Google Scholar 

  33. Kejriwal, M., Miranker, D.P.: N-way heterogeneous blocking. Tech. rep., TR-14-06 (2013)

    Google Scholar 

  34. Kejriwal, M., Miranker, D.P.: An unsupervised algorithm for learning blocking schemes. In: IEEE 13th International Conference on Data Mining (ICDM), pp. 340–349. IEEE, Piscataway (2013)

    Google Scholar 

  35. Kejriwal, M., Miranker, D.P.: On linking heterogeneous dataset collections. In: International Semantic Web Conference (Posters & Demos), pp. 217–220. Citeseer (2014)

    Google Scholar 

  36. Kejriwal, M., Miranker, D.P.: Sorted neighborhood for schema-free RDF data. In: International Semantic Web Conference. pp. 217–229. Springer, Berlin (2015)

    Chapter  Google Scholar 

  37. Kejriwal, M., Miranker, D.P.: An unsupervised instance matcher for schema-free RDF data. Web Semant. Sci. Serv. Agents World Wide Web 35, 102–123 (2015)

    Article  Google Scholar 

  38. Kim, H.S., Lee, D.: Harra: fast iterative hashed record linkage for large-scale data collections. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 525–536. ACM, New York (2010)

    Google Scholar 

  39. Kolb, L., Thor, A., Rahm, E.: Parallel sorted neighborhood blocking with MapReduce (2010). arXiv:1010.3053

    Google Scholar 

  40. Kolb, L., Thor, A., Rahm, E.: Multi-pass sorted neighborhood blocking with MapReduce. Comput. Sci. Res. Dev. 27(1), 45–63 (2012)

    Article  Google Scholar 

  41. Lehti, P., Fankhauser, P.: Unsupervised duplicate detection using sample non-duplicates. Lect. Notes Comput. Sci. 4244, 136 (2006)

    Article  Google Scholar 

  42. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)

    Book  Google Scholar 

  43. Liang, H., Wang, Y., Christen, P., Gayler, R.: Noise-tolerant approximate blocking for dynamic real-time entity resolution. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 449–460. Springer, Berlin (2014)

    Chapter  Google Scholar 

  44. Ma, K., Yang, B.: Parallel NoSQL entity resolution approach with MapReduce. In: International Conference on Intelligent Networking and Collaborative Systems (INCOS), pp. 384–389. IEEE, Piscataway (2015)

    Google Scholar 

  45. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178. ACM, New York (2000)

    Google Scholar 

  46. Mestre, D.G., Pires, C.E., Nascimento, D.C.: Adaptive sorted neighborhood blocking for entity matching with MapReduce. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing, pp. 981–987. ACM, New York (2015)

    Google Scholar 

  47. Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI’06 Proceedings of the 21st National Conference on Artificial Intelligence, pp. 440–445 (2006)

    Google Scholar 

  48. Naumann, F., Herschel, M.: An introduction to duplicate detection. Synth. Lect. Data Manage. 2(1), 1–87 (2010)

    Article  Google Scholar 

  49. Papadakis, G., Palpanas, T.: Blocking for large-scale entity resolution: challenges, algorithms, and practical examples. In: IEEE 32nd International Conference on Data Engineering (ICDE), pp. 1436–1439. IEEE, Piscataway (2016)

    Google Scholar 

  50. Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: Eliminating the redundancy in blocking-based entity resolution methods. In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, pp. 85–94. ACM, New York (2011)

    Google Scholar 

  51. Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., Nejdl, W.: A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25(12), 2665–2682 (2013)

    Article  Google Scholar 

  52. Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: taking entity resolution to the next level. IEEE Trans. Knowl. Data Eng. 26(8), 1946–1960 (2014)

    Article  Google Scholar 

  53. Papadakis, G., Papastefanatos, G., Koutrika, G.: Supervised meta-blocking. Proc. VLDB Endowment 7(14), 1929–1940 (2014)

    Article  Google Scholar 

  54. Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. Proc. VLDB Endowment 9(4), 312–323 (2015)

    Article  Google Scholar 

  55. Papadakis, G., Papastefanatos, G., Palpanas, T., Koubarakis, M.: Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In: 19th International Conference on Extending Database Technology, pp. 221–232 (2016)

    Google Scholar 

  56. Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endowment 9(9), 684–695 (2016)

    Article  Google Scholar 

  57. Papenbrock, T., Heise, A., Naumann, F.: Progressive duplicate detection. IEEE Trans. Knowl. Data Eng. 27(5), 1316–1329 (2015)

    Article  Google Scholar 

  58. Ramadan, B., Christen, P.: Forest-based dynamic sorted neighborhood indexing for real-time entity resolution. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 1787–1790. ACM, New York (2014)

    Google Scholar 

  59. Ramadan, B., Christen, P.: Unsupervised blocking key selection for real-time entity resolution. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 574–585. Springer, Berlin (2015)

    Chapter  Google Scholar 

  60. Ramadan, B., Christen, P., Liang, H., Gayler, R.W., Hawking, D.: Dynamic similarity-aware inverted indexing for real-time entity resolution. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 47–58. Springer, Berlin (2013)

    Google Scholar 

  61. Ramadan, B., Christen, P., Liang, H., Gayler, R.W.: Dynamic sorted neighborhood indexing for real-time entity resolution. J. Data Inf. Qual. 6(4), 15 (2015)

    Google Scholar 

  62. Rice, S.V.: Braided AVL trees for efficient event sets and ranked sets in the SIMSCRIPT III simulation programming language. In: Western MultiConference on Computer Simulation, San Diego, pp. 150–155 (2007)

    Google Scholar 

  63. Shu, L., Chen, A., Xiong, M., Meng, W.: Efficient spectral neighborhood blocking for entity resolution. In: IEEE 27th International Conference on Data Engineering (ICDE), pp. 1067–1078. IEEE, Piscataway (2011)

    Google Scholar 

  64. Simonini, G., Bergamaschi, S., Jagadish, H.: Blast: a loosely schema-aware meta-blocking approach for entity resolution. Proc. VLDB Endowment 9(12), 1173–1184 (2016)

    Article  Google Scholar 

  65. Song, D., Heflin, J.: Scaling data linkage generation with domain-independent candidate selection. In: Proceedings of the 10th International Semantic Web Conference (ISWC) (2013)

    Google Scholar 

  66. Song, D., Heflin, J.: Automatically generating data linkages using a domain-independent candidate selection approach. In: International Semantic Web Conference, pp. 649–664. Springer, Berlin (2011)

    Chapter  Google Scholar 

  67. Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: International Conference on Privacy in Statistical Databases, pp. 253–268. Springer, Berlin (2014)

    Google Scholar 

  68. Tamilselvi, J.J., Gifta, C.B.: Handling duplicate data in data warehouse for data mining. Int. J. Comput. Appl. 15(4), 1–9 (2011)

    Google Scholar 

  69. Tamilselvi, J., Saravanan, V.: Token-based method of blocking records for large data warehouse. Adv. Inf. Mining 2(2), 5–10 (2010)

    Google Scholar 

  70. Wang, J.T.L., Wang, X., Lin, K.I., Shasha, D., Shapiro, B.A., Zhang, K.: Evaluating a class of distance-mapping algorithms for data mining and clustering. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 307–311. ACM, New York (1999)

    Google Scholar 

  71. Wang, Q., Cui, M., Liang, H.: Semantic-aware blocking for entity resolution. IEEE Trans. Knowl. Data Eng. 28(1), 166–180 (2016)

    Article  Google Scholar 

  72. Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 219–232. ACM, New York (2009)

    Google Scholar 

  73. Whang, S.E., Marmaros, D., Garcia-Molina, H.: Pay-as-you-go entity resolution. IEEE Trans. Knowl. Data Eng. 25(5), 1111–1124 (2013)

    Article  Google Scholar 

  74. Winkler, W.E.: Overview of record linkage and current research directions. Bureau of the Census. Citeseer (2006)

    Google Scholar 

  75. Yan, S., Lee, D., Kan, M.Y., Giles, L.C.: Adaptive sorted neighborhood methods for efficient record linkage. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 185–194. ACM, New York (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kevin O’Hare .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

O’Hare, K., Jurek-Loughrey, A., Campos, C.d. (2019). A Review of Unsupervised and Semi-supervised Blocking Methods for Record Linkage. In: P, D., Jurek-Loughrey, A. (eds) Linking and Mining Heterogeneous and Multi-view Data. Unsupervised and Semi-Supervised Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-01872-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01872-6_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01871-9

  • Online ISBN: 978-3-030-01872-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics