Skip to main content

A Deep-Learning-Based Blocking Technique for Entity Linkage

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12112))

Included in the following conference series:

Abstract

Nowadays, data integration must often manage noisy data, also containing attribute values written in natural language such as product descriptions or book reviews.

Entity Linkage has the role of identifying records that contain information referring to the same object. Modern Entity Linkage methods, in order to reduce the dimension of the problem, partition the initial search space into “Blocks” of records that can be considered similar according to some metrics, greatly reducing the overall complexity of the algorithm.

We propose a Blocking strategy that, differently from the traditional methods, aims at capturing the semantic properties of data by means of recent Deep Learning frameworks. This paper is mainly inspired by a recent work on Entity Linkage whose authors were among the first to investigate the application of tuple embeddings to data integration problems. We extend their method adopting an unsupervised approach: our blocking model is trained on an external corpus and then used on new datasets, exploiting a “transfer learning” paradigm. Our choice is motivated by the fact that, in most data integration scenarios, no training data is actually available. Using a semi-automatic approach to blocking, our model, after being trained on an external corpus, can be directly applied to any data integration problem.

We tested our system on six popular datasets and compared its performance against five traditional blocking algorithms. The test results demonstrated that our deep-learning-based blocking solution outperforms standard blocking algorithms, especially on textual and noisy data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://commoncrawl.org/the-data/.

  2. 2.

    https://www.nltk.org/_modules/nltk/tokenize.html.

  3. 3.

    https://cloud.google.com/compute/.

References

  1. Wang, J., Shen, H.T., Song, J., Ji, J.: Hashing for similarity search: a survey. arXiv:1408.2927 (2014)

  2. Pennington, J., Socher, R., Manning, C.D.: https://github.com/stanfordnlp/GloVe

  3. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: https://fasttext.cc/docs/en/english-vectors.html

  4. Bowman, S.R., Potts, C., Manning, C.D., Angeli, G.: A large annotated corpus for learning natural language inference. https://nlp.stanford.edu/projects/snli/

  5. Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endowment (2018)

    Google Scholar 

  6. Hernandez, M.A., Stolfo, S.J.: The merge-purge problem for large databases. In: ACM SIGMOD 95 (1995)

    Google Scholar 

  7. Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: KDD (2003)

    Google Scholar 

  8. Aizawa, A., Oyama, K.: A fast linkage detection scheme for multi-source information integration. In: WIRI 2005 (2005)

    Google Scholar 

  9. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: ACM SIGKDD 2000 (2000)

    Google Scholar 

  10. Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI, Boston (2006)

    Google Scholar 

  11. Das, S., et al.: Falcon: scaling up hands-off crowdsourced entity matching to build cloud services. In: ACM SIGMOD 2017 (2017)

    Google Scholar 

  12. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. arXiv:1705.02364 (2018)

  13. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: https://github.com/facebookresearch/InferSent

  14. Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: Proceedings of the VLDB Endowment (2007)

    Google Scholar 

  15. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication (2011)

    Google Scholar 

  16. Koepcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. In: Proceedings of the VLDB Endowment (2010)

    Google Scholar 

  17. Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity linkage. In: Proceedings of the VLDB Endowment (2016)

    Google Scholar 

  18. RIDDLE repository. www.cs.utexas.edu/users/ml/riddle/data.html

  19. Creative Commons license. https://dbs.uni-leipzig.de/en/research/projects/object_matching/benchmark_datasets_for_entity_resolution

  20. Christen, P.: Data Matching. Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2

    Book  Google Scholar 

  21. Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: ACM SIGMOD 2018 (2018)

    Google Scholar 

  22. Perone, C.P., Silveira, R., Paula, T. S.: Evaluation of sentence embeddings in downstream and linguistic probing tasks. arXiv:1806.06259 (2018)

  23. Thirumuruganathan, S., Parambath, S.A.P., Ouzzani, M., Tang, N., Joty, S.: Reuse and adaptation for entity resolution through transfer learning. arXiv:1809.11084 (2018)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabio Azzalini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Azzalini, F., Renzi, M., Tanca, L. (2020). A Deep-Learning-Based Blocking Technique for Entity Linkage. In: Nah, Y., Cui, B., Lee, SW., Yu, J.X., Moon, YS., Whang, S.E. (eds) Database Systems for Advanced Applications. DASFAA 2020. Lecture Notes in Computer Science(), vol 12112. Springer, Cham. https://doi.org/10.1007/978-3-030-59410-7_37

Download citation

Publish with us

Policies and ethics