Abstract
Nowadays, data integration must often manage noisy data, also containing attribute values written in natural language such as product descriptions or book reviews.
Entity Linkage has the role of identifying records that contain information referring to the same object. Modern Entity Linkage methods, in order to reduce the dimension of the problem, partition the initial search space into “Blocks” of records that can be considered similar according to some metrics, greatly reducing the overall complexity of the algorithm.
We propose a Blocking strategy that, differently from the traditional methods, aims at capturing the semantic properties of data by means of recent Deep Learning frameworks. This paper is mainly inspired by a recent work on Entity Linkage whose authors were among the first to investigate the application of tuple embeddings to data integration problems. We extend their method adopting an unsupervised approach: our blocking model is trained on an external corpus and then used on new datasets, exploiting a “transfer learning” paradigm. Our choice is motivated by the fact that, in most data integration scenarios, no training data is actually available. Using a semi-automatic approach to blocking, our model, after being trained on an external corpus, can be directly applied to any data integration problem.
We tested our system on six popular datasets and compared its performance against five traditional blocking algorithms. The test results demonstrated that our deep-learning-based blocking solution outperforms standard blocking algorithms, especially on textual and noisy data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wang, J., Shen, H.T., Song, J., Ji, J.: Hashing for similarity search: a survey. arXiv:1408.2927 (2014)
Pennington, J., Socher, R., Manning, C.D.: https://github.com/stanfordnlp/GloVe
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: https://fasttext.cc/docs/en/english-vectors.html
Bowman, S.R., Potts, C., Manning, C.D., Angeli, G.: A large annotated corpus for learning natural language inference. https://nlp.stanford.edu/projects/snli/
Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endowment (2018)
Hernandez, M.A., Stolfo, S.J.: The merge-purge problem for large databases. In: ACM SIGMOD 95 (1995)
Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: KDD (2003)
Aizawa, A., Oyama, K.: A fast linkage detection scheme for multi-source information integration. In: WIRI 2005 (2005)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: ACM SIGKDD 2000 (2000)
Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI, Boston (2006)
Das, S., et al.: Falcon: scaling up hands-off crowdsourced entity matching to build cloud services. In: ACM SIGMOD 2017 (2017)
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. arXiv:1705.02364 (2018)
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: https://github.com/facebookresearch/InferSent
Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: Proceedings of the VLDB Endowment (2007)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication (2011)
Koepcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. In: Proceedings of the VLDB Endowment (2010)
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity linkage. In: Proceedings of the VLDB Endowment (2016)
RIDDLE repository. www.cs.utexas.edu/users/ml/riddle/data.html
Creative Commons license. https://dbs.uni-leipzig.de/en/research/projects/object_matching/benchmark_datasets_for_entity_resolution
Christen, P.: Data Matching. Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: ACM SIGMOD 2018 (2018)
Perone, C.P., Silveira, R., Paula, T. S.: Evaluation of sentence embeddings in downstream and linguistic probing tasks. arXiv:1806.06259 (2018)
Thirumuruganathan, S., Parambath, S.A.P., Ouzzani, M., Tang, N., Joty, S.: Reuse and adaptation for entity resolution through transfer learning. arXiv:1809.11084 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Azzalini, F., Renzi, M., Tanca, L. (2020). A Deep-Learning-Based Blocking Technique for Entity Linkage. In: Nah, Y., Cui, B., Lee, SW., Yu, J.X., Moon, YS., Whang, S.E. (eds) Database Systems for Advanced Applications. DASFAA 2020. Lecture Notes in Computer Science(), vol 12112. Springer, Cham. https://doi.org/10.1007/978-3-030-59410-7_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-59410-7_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59409-1
Online ISBN: 978-3-030-59410-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)