Skip to main content
Log in

A Survey on Blocking Technology of Entity Resolution

  • Survey
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Entity resolution (ER) is a significant task in data integration, which aims to detect all entity profiles that correspond to the same real-world entity. Due to its inherently quadratic complexity, blocking was proposed to ameliorate ER, and it offers an approximate solution which clusters similar entity profiles into blocks so that it suffices to perform pairwise comparisons inside each block in order to reduce the computational cost of ER. This paper presents a comprehensive survey on existing blocking technologies. We summarize and analyze all classic blocking methods with emphasis on different blocking construction and optimization techniques. We find that traditional blocking ER methods which depend on the fixed schema may not work in the context of highly heterogeneous information spaces. How to use schema information flexibly is of great significance to efficiently process data with the new features of this era. Machine learning is an important tool for ER, but end-to-end and efficient machine learning methods still need to be explored. We also sum up and provide the most promising trend for future work from the directions of real-time blocking ER, incremental blocking ER, deep learning with ER, etc.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Bhattacharya I, Getoor L. A Latent Dirichlet model for unsupervised entity resolution. In Proc. the 2006 SIAM International Conference on Data Mining, Apr. 2006, pp.47-58.

  2. Liu X L, Wang H Z, Li J Z, Gao H. EntityManager: Managing dirty data based on entity resolution. J. Comput. Sci. Technol., 2017, 32(3): 644-662.

    MathSciNet  Google Scholar 

  3. Winkler W E. Methods for evaluating and creating data quality. Inf. Syst., 2004, 29(7): 531-550.

    Google Scholar 

  4. Winkler WE. Overview of record linkage and current research directions. Technical Report, U.S. Bureau of the Census, 2006. http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=7D191FC85CD0D418884ACD2CECC2C190?doi=1-0.1.1.79.1519&rep=rep1&type=pdf, March 2020.

  5. Newcombe H B, Kennedy J M, Axford S J, James A P. Automatic linkage of vital records. Science, 1959, 130(3381): 954-959.

    Google Scholar 

  6. Bhattacharya I, Getoor L. Iterative record linkage for cleaning and integration. In Proc. the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Jun. 2004, pp.11-18.

  7. Pasula H, Marthi B, Milch B, Russell S J, Shpitser I. Identity uncertainty and citation matching. In Proc. the 2002 Annual Conference on Neural Information Processing Systems, Dec. 2002, pp.1401-1408.

  8. Fan W F, Jia X B, Li J Z, Ma S. Reasoning about record matching rules. Proc. the VLDB Endowment, 2009, 2(1): 407-418.

    Google Scholar 

  9. Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In Proc. the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2003, pp.39-48.

  10. Bhattacharya I, Getoor L. Deduplication and group detection using links. In Proc. the 2004 ACM SIGKDD Workshop on Link Analysis and Group Detection, August 2004.

  11. Getoor L, Machanavajjhala A. Entity resolution: Theory, practice & open challenges. Proc. the VLDB Endowment, 2012, 5(12): 2018-2019.

    Google Scholar 

  12. Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 2011, 24(9): 1537-1555.

    Google Scholar 

  13. Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, 2012.

  14. Dong X L, Srivastava D. Big data integration. In Proc. the 29th International Conference on Data Engineering, Apr. 2013, pp.1245-1248.

  15. Papadakis G, Skoutas D, Thanos E, Palpanas T. A survey of blocking and filtering techniques for entity resolution. arXiv:1905.06167, 2019. https://arxiv.org/abs/1905.06167, March 2020.

  16. Dunn H L. Record linkage. American Journal of Public Health and the Nation’s Health, 1946, 36(12): 1412-1416.

    Google Scholar 

  17. Christophides V, Efthymiou V, Stefanidis K. Entity resolution in the web of data. Synthesis Lectures on the Semantic Web, 2015, 5(3): 1-122.

    Google Scholar 

  18. Papadakis G, Ioannou E, Palpanas T, Niederée C, Nejdl W. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Transactions on Knowledge and Data Engineering, 2012, 25(12): 2665-2682.

    Google Scholar 

  19. Papadakis G, Ioannou E, Niederée C, Palpanas T, Nejdl W. Eliminating the redundancy in blocking-based entity resolution methods. In Proc. the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, Jun. 2011, pp.85-94.

  20. Papadakis G, Alexiou G, Papastefanatos G, Koutrika G. Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. Proc. the VLDB Endowment, 2015, 9(4): 312-323.

    Google Scholar 

  21. Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183-1210.

    MATH  Google Scholar 

  22. Aizawa A, Oyama K. A fast linkage detection scheme for multi-source information integration. In Proc. the 2005 International Workshop on Challenges in Web Information Retrieval and Integration, Apr. 2005, pp.30-39.

  23. de Vries T, Ke H, Chawla S, Christen P. Robust record linkage blocking using suffix arrays. In Proc. the 18th ACM Conference on Information and Knowledge Management, Nov. 2009, pp.305-314.

  24. Allam A, Skiadopoulos S, Kalnis P. Improved suffix blocking for record linkage and entity resolution. Data & Knowledge Engineering, 2018, 117: 98-113.

    Google Scholar 

  25. Gravano L, Ipeirotis P G, Jagadish H V, Koudas N, Muthukrishnan S, Srivastava D. Approximate string joins in a database (almost) for free. In Proc. the 27th International Conference on Very Large Data Bases, Sept. 2001, pp.491-500.

  26. Baxter R, Christen P, Churches T. A comparison of fast blocking methods for record linkage. In Proc. the ACM SIGKDD 2003 Workshop on Data Cleaning, Record Linkage and Object Consolidation, Aug. 2003.

  27. Kenig B, Gal A. MFIBlocks: An effective blocking algorithm for entity resolution. Information Systems, 2013, 38(6): 908-926.

    Google Scholar 

  28. McCallum A, Nigam K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2000, pp.169-178.

  29. Hernández M A, Stolfo S J. The merge/purge problem for large databases. ACM SIGMOD Record, 1995, 24(2): 127-138.

    Google Scholar 

  30. Yan S, Lee D, Kan M Y, Giles L C. Adaptive sorted neighborhood methods for efficient record linkage. In Proc. the 7th ACM/IEEE Joint Conference on Digital Libraries, Jun. 2007, pp.185-194.

  31. Draisbach U, Naumann F. A comparison and generalization of blocking and windowing algorithms for duplicate detection. In Proc. the International Workshop on Quality in Databases, Aug. 2009, pp.51-56.

  32. Draisbach U, Naumann F, Szott S, Wonneberg O. Adaptive windows for duplicate detection. In Proc. the 28th IEEE International Conference on Data Engineering, Apr. 2012, pp.1073-1083.

  33. Draisbach U, Naumann F. A generalization of blocking and windowing algorithms for duplicate detection. In Proc. the 2011 International Conference on Data and Knowledge Engineering, Sept. 2011, pp.18-24.

  34. Papadakis G, Ioannou E, Niederée C, Fankhauser P. Efficient entity resolution for large heterogeneous information spaces. In Proc. the 4th International Conference on Web Search and Web Data Mining, Feb. 2011, pp.535-544.

  35. Papadakis G, Ioannou E, Niederée C, Palpanas T, Nejdl W. Beyond 100 million entities: Large-scale blocking-based resolution for heterogeneous data. In Proc. the 5th International Conference on Web Search and Web Data Mining, Feb. 2012, pp.53-62.

  36. Song D, Heflin J. Automatically generating data linkages using a domain-independent candidate selection approach. In Proc. the 10th International Semantic Web Conference, Oct. 2011, pp.649-664.

  37. Nin J, Muntes-Mulero V, Martínez-Bazan N, Larriba-Pey J. On the use of semantic blocking techniques for data cleansing and integration. In Proc. the 11th International Database Engineering and Applications Symp., Sept. 2007, pp.190-198.

  38. Bilenko M, Kamath B, Mooney R J. Adaptive blocking: Learning to scale up record linkage. In Proc. the 6th IEEE International Conference on Data Mining, Dec. 2006, pp.87-96.

  39. Michelson M, Knoblock C A. Learning blocking schemes for record linkage. In Proc. the 21st National Conference on Artificial Intelligence and the 18th Innovative Applications of Artificial Intelligence Conference, Jul. 2006, pp.440-445.

  40. Evangelista L O, Cortez E, da Silva A S, Meira W. Adaptive and flexible blocking for record linkage tasks. Journal of Information and Data Management, 2010, 1(2): 167-167.

    Google Scholar 

  41. Sarma A D, Jain A, Machanavajjhala A, Bohannon P. An automatic blocking mechanism for large-scale deduplication tasks. In Proc. the 21st ACM International Conference on Information and Knowledge Management, Oct. 2012, pp.1055-1064.

  42. Giang P H. A machine learning approach to create blocking criteria for record linkage. Health Care Management Science, 2015, 18(1): 93-105.

    MathSciNet  Google Scholar 

  43. Cao Y, Chen Z, Zhu J, Yue P, Lin C, Yu Y. Leveraging unlabeled data to scale blocking for record linkage. In Proc. the 22nd International Joint Conference on Artificial Intelligence, Jul. 2011, pp.2211-2217.

  44. Mayank K, Daniel P M. An unsupervised algorithm for learning blocking schemes. In Proc. the 13th International Conference on Data Mining, Dec. 2019, pp.340-349.

  45. O’Hare K, Jurek-Loughrey A, de Campos C. An unsupervised blocking technique for more efficient record linkage. Data & Knowledge Engineering, 2019, 122: 181-195.

    Google Scholar 

  46. Ma Y T, Tran T. TYPiMatch: Type-specific unsupervised learning of keys and key values for heterogeneous web data integration. In Proc. the 6th ACM International Conference on Web Search and Data Mining, Feb. 2013, pp.325-334.

  47. Kejriwal M, Miranker D P. A two-step blocking scheme learner for scalable link discovery. In Proc. the 9th International Workshop on Ontology Matching Collocated with the 13th International Semantic Web Conference, Oct. 2014, pp.49-60.

  48. Kejriwal M, Miranker D P. A DNF blocking scheme learner for heterogeneous datasets. arXiv:1501.01694. https://arxiv.org/abs/1501.01694, Jan. 2020.

  49. Papadakis G, Ioannou E, Niederée C, Palpanas T, Nejdl W. To compare or not to compare: Making entity resolution more efficient. In Proc. the International Workshop on Semantic Web Information Management, Jun. 2011, Article No. 3.

  50. Papadakis G, Koutrika G, Palpanas T, Nejdl W. Metablocking: Taking entity resolution to the next level. IEEE Transactions on Knowledge and Data Engineering, 2013, 26(8): 1946-1960.

    Google Scholar 

  51. Ferragina P, Grossi R. The string B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM, 1999, 46(2): 236-280.

    MathSciNet  MATH  Google Scholar 

  52. Christen P. Towards parameter-free blocking for scalable record linkage. Technical Report, Faculty of Engineering and Information Technology, 2007. http://users.cecs.anu.edu.au/˜Peter.Christen/publications/tr-cs-07-03.pdf, March 2020.

  53. Madhavan J, Jeffery S R, Cohen S, Dong X L, Ko D, Yu C, Halevy A. Web-scale data integration: You can only afford to pay as you go. In Proc. the 3rd Biennial Conference on Innovative Data Systems Research, Jan. 2007, pp.342-350.

  54. Papadakis G, Palpanas T. Blocking for large-scale entity resolution: Challenges, algorithms, and practical examples. In Proc. the 32nd International Conference on Data Engineering, May 2016, pp.1436-1439.

  55. Chaudhuri S, Chen B, Ganti V, Kaushik R. Example-driven design of efficient record matching queries. In Proc. the 33rd International Conference on Very Large Data Bases, Sept. 2007, pp.327-338.

  56. Papadakis G, Demartini G, Fankhauser P, Kärger P. The missing links: Discovering hidden same-as links among a billion of triples. In Proc. the 12th International Conference on Information Integration and Web-Based Applications and Services, Nov. 2010, pp.453-460.

  57. Winkler W E. Approximate string comparator search strategies for very large administrative lists. Technical Report, U.S. Census Bureau, 2005. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.402&rep=rep1&type=pdf, March 2020.

  58. Valiant L G. A theory of the learnable. Communications of the ACM, 1984, 27(11): 1134-1142.

    MATH  Google Scholar 

  59. Suchanek F M, Abiteboul S, Senellart P. PARIS: Probabilistic alignment of relations, instances, and schema. Proceedings the VLDB Endowment, 2011, 5(3): 157-168.

    Google Scholar 

  60. Papadakis G, Svirsky J, Gal A, Palpanas T. Comparative analysis of approximate blocking techniques for entity resolution. Proceedings the VLDB Endowment, 2016, 9(9): 684-695.

    Google Scholar 

  61. Jonker R, Volgenant T. Improving the Hungarian assignment algorithm. Operations Research Letters, 1986, 5(4): 171-175.

    MathSciNet  MATH  Google Scholar 

  62. Verikas A, Gelzinis A, Bacauskiene M. Mining data with random forests: A survey and results of new tests. Pattern Recognition, 2011, 44(2): 330-349.

    Google Scholar 

  63. Bilke A, Naumann F. Schema matching using duplicates. In Proc. the 21st International Conference on Data Engineering, Apr. 2005, pp.69-80.

  64. Papadakis G, Papastefanatos G, Palpanas T, Koubarakis M. Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In Proc. the 19th International Conference on Extending Database Technology, Mar. 2016, pp.221-232.

  65. Fisher J, Christen P, Wang Q, Rahm E. A clustering-based framework to control block sizes for entity resolution. In Proc. the 21st International Conference on Knowledge Discovery and Data Mining, Aug. 2015, pp.279-288.

  66. Whang S E, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H. Entity resolution with iterative blocking. In Proc. the International Conference on Management of Data, Jun. 2009, pp.219-232.

  67. Shu L, Chen A, Xiong M, Meng W. Efficient SPectrAl Neighborhood blocking for entity resolution. In Proc. the 27th International Conference on Data Engineering, Apr. 2011, pp.1067-1078.

  68. Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905.

    Google Scholar 

  69. Christen P, Gayler R, Hawking D. Similarity-aware indexing for real-time entity resolution. In Proc. the 18th ACM Conference on Information and Knowledge Management, Nov. 2009, pp.1565-1568.

  70. Ramadan B, Christen P, Liang H, Gayler R W, Hawking D. Dynamic similarity-aware inverted indexing for real-time entity resolution. In Proc. the 2013 PAKDD Workshop on Data Mining Applications in Industry and Government, Apr. 2013, pp.47-58.

  71. Ramadan B, Christen P. Forest-based dynamic sorted neighborhood indexing for real-time entity resolution. In Proc. the 23rd International Conference on Information and Knowledge Management, Nov. 2014, pp.1787-1790.

  72. Ramadan B, Christen P, Liang H, Gayler R W. Dynamic sorted neighborhood indexing for real-time entity resolution. Journal of Data and Information Quality, 2015, 6(4): Article No. 15.

  73. Rice S V. Braided AVL trees for efficient event sets and ranked sets in the SIMSCRIPT III simulation programming language. In Proc. the 2007 Western Multiconference on Computer Simulation, Jan. 2007, pp.150-155.

  74. Ramadan B, Christen P. Unsupervised blocking key selection for real-time entity resolution. In Proc. the 19th Pacific-Asia Conference on Knowledge Discovery and Data Mining, May 2015, pp.574-585.

  75. Liang H, Wang Y, Christen P, Gayler R. Noise-tolerant approximate blocking for dynamic real-time entity resolution. In Proc. the 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining, May 2014, pp.449-460.

  76. Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang S E, Widom J. Swoosh: A generic approach to entity resolution. The VLDB Journal, 2009, 18(1): 255-276.

    Google Scholar 

  77. Araújo B T, Stefanidis K, Pires C E S, Nummenmaa J, da Nóbrega P T. Incremental blocking for entity resolution over web streaming data. In Proc. the 2019 IEEE/WIC/ACM International Conference on Web Intelligence, Oct. 2019, pp.332-336.

  78. Ebraheem M, Thirumuruganathan S, Joty S, Ouzzani M, Tang N. Distributed representations of tuples for entity resolution. Proc. the VLDB Endowment, 2018, 11(11): 1454-1467.

    Google Scholar 

  79. Mudgal S, Li H, Rekatsinas T, Doan A, Park Y, Krishnan G, Deep R, Arcaute E, Raghavendra V. Deep learning for entity matching: A design space exploration. In Proc. the 2018 International Conference on Management of Data, Jun. 2018, pp.19-34.

  80. di Cicco V, Firmani D, Koudas N, Merialdo P, Srivastava D. Interpreting deep learning models for entity resolution: An experience report using LIME. In Proc. the 2nd International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, Jul. 2019, Article No. 8.

  81. Papenbrock T, Heise A, Naumann F. Progressive duplicate detection. IEEE Transactions on Knowledge and Data Engineering, 2014, 27(5): 1316-1329.

    Google Scholar 

  82. Whang S E, Marmaros D, Garcia-Molina H. Pay-as-you-go entity resolution. IEEE Transactions on Knowledge and Data Engineering, 2012, 25(5): 1111-1124.

    Google Scholar 

  83. Altowim Y, Kalashnikov D V, Mehrotra S. Progressive approach to relational entity resolution. Proc. the VLDB Endowment, 2014, 7(11): 999-1010.

    Google Scholar 

  84. Altowim Y, Mehrotra S. Parallel progressive approach to entity resolution using MapReduce. In Proc. the 33rd IEEE International Conference on Data Engineering, Apr. 2017, pp.909-920.

  85. Bhattacharya I, Getoor L. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): Article No. 5.

  86. Globerson A, Lazic N, Chakrabarti S, Subramanya A, Ringaard M, Pereira F. Collective entity resolution with multi-focal attention. In Proc. the 54th Annual Meeting of the Association for Computational Linguistics, Aug. 2016.

  87. Kouki P, Pujara J, Marcum C, Koehly L, Getoor L. Collective entity resolution in familial networks. In Proc. the 2017 IEEE International Conference on Data Mining, Nov. 2017, pp.227-236.

  88. de Assis Costa G, de Oliveira J M P. A relational learning approach for collective entity resolution in the web of data. In Proc. the 5th International Workshop on Consuming Linked Data Co-Located with the 13th International Semantic Web Conference, Oct. 2014.

  89. Kouki P, Pujara J, Marcum C, Koehly L, Getoor L. Collective entity resolution in multi-relational familial networks. Knowledge and Information Systems, 2019, 61(3): 1547-1581.

    Google Scholar 

  90. Wang J, Kraska T, Franklin M J, Feng J. CrowdER: Crowdsourcing entity resolution. arXiv:1208.1927, 2012. http://arxiv.org/abs/1208.1927, Aug. 2018.

  91. Vesdapunt N, Bellare K, Dalvi N. Crowdsourcing algorithms for entity resolution. Proc. the VLDB Endowment, 2014, 7(12): 1071-1082.

    Google Scholar 

  92. Gong S S, Hu W, Ge W Y, Qu Y Z. Modeling topic-based human expertise for crowd entity resolution. Journal of Computer Science and Technology, 2018, 33(6): 1204-1218.

    MathSciNet  Google Scholar 

  93. Zhang A Z, Li J Z, Gao H, Chen Y B, Ma H Z, Bah M J. CrowdOLA: Online aggregation on duplicate data powered by crowdsourcing. Journal of Computer Science and Technology, 2018, 33(2): 366-379.

    Google Scholar 

  94. Mazumdar A, Saha B. A theoretical analysis of first heuristics of crowdsourced entity resolution. In Proc. the 31st AAAI Conference on Artificial Intelligence, Feb. 2017, pp.970-976.

  95. Chai C, Li G, Li J, Deng D, Feng J. A partial-order-based framework for cost-effective crowdsourced entity resolution. The VLDB Journal, 2018, 27(6): 745-770.

    Google Scholar 

  96. Maskat R, Paton N W, Embury S M. Pay-as-you-go configuration of entity resolution. Transactions on Large-Scale Data-and Knowledge-Centered Systems, 2016, 29: 40-65.

    Google Scholar 

  97. Li H, Konda P, G.C. P S, Doan A, Snyder B, Park, Y, Krishnan G, Deep R, Raghavendra V. MatchCatcher: A debugger for blocking in entity matching. In Proc. the 21st International Conference on Extending Database Technology, Mar. 2018, pp.193-204.

  98. Papadakis G, Giannakopoulos G, Niederée C, Palpanas T, Nejdl W. Detecting and exploiting stability in evolving heterogeneous information spaces. In Proc. the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, Jun. 2011, pp.95-104.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo-Han Li.

Electronic supplementary material

ESM 1

(PDF 227 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, BH., Liu, Y., Zhang, AM. et al. A Survey on Blocking Technology of Entity Resolution. J. Comput. Sci. Technol. 35, 769–793 (2020). https://doi.org/10.1007/s11390-020-0350-4

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-020-0350-4

Keywords

Navigation