Abstract
In Entity Resolution (ER), more and more unstructured records impose challenge to the traditional similarity-based approaches, since existing similarity metrics are designed for structured records. Now that similarity is hard to measure for unstructured records, can we do pairwise matching without similarity measure? To answer this question, this research leverages deep learning’s artificial intelligence to learn the underlying record matched pattern, rather than measuring records similarity first and then making linking decision based on the similarity measure. In the representation part, token order information is taken into account in word embedding, and not considered in Bag-of-Words (Count and TF-IDF); in the model part, multilayer perceptron (MLP), convolutional neural network (CNN), and long short-term memory (LSTM) are examined. Our experiments on both synthetic data and real-world data demonstrate that, surprisingly, the simplest representation (Count) and the simplest model (MLP) together get the best results both in effectiveness and efficiency. An F-measure as high as 1.00 in the pairwise matching task shows potential for further applying deep learning in other ER tasks like blocking.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
J.R. Talburt, Entity Resolution and Information Quality (Morgan Kaufmann, New York, USA, 2011)
M.A. Hernández, S.J. Stolfo, Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Disc. 2(1), 9–37 (1998)
T.W. Victor, R.M. Mera, Record linkage of health care insurance claims. J. Am. Med. Inform. Assoc. 8(3), 281–288 (2001)
H. Köpcke, A. Thor, S. Thomas, E. Rahm, Tailoring entity resolution for matching product offers. In Proceedings of the 15th International Conference on Extending Database Technology, 2012 Mar 27, pp. 545–550
S.E. Whang, H. Garcia-Molina, Entity resolution with evolving rules. Proc. VLDB Endowment 3(1–2), 1326–1337 (2010)
L. Li, J. Li, H. Gao, Rule-based method for entity resolution. IEEE Trans. Knowl. Data Eng. 27(1), 250–263 (2014)
I.P. Fellegi, A.B. Sunter, A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
P. Wang, D. Pullen, J.R. Talburt, C. Chen, A method for match key blocking in probabilistic matching. In Information Technology: New Generations, pp. 847–857, 2016
L. Kolb, H. Köpcke, A. Thor, E. Rahm, Learning-based entity resolution with MapReduce, in Proceedings of the Third International Workshop on Cloud Data Management, (2011 Oct 28), pp. 1–6
Z. Chen, Z. Li, Gradual Machine Learning for Entity Resolution. arXiv preprint arXiv:1810.12125 (2018)
I. Bhattacharya, L. Getoor, A latent Dirichlet model for unsupervised entity resolution. In Proceedings of the 2006 SIAM International Conference on Data Mining, 2006 Apr 20, pp. 47–58
S. Song, L. Chen, Probabilistic correlation-based similarity measure of unstructured records. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, 2007 Nov 6, pp. 967–970
J. Wang, G. Li, J.X. Yu, J. Feng, Entity matching: how similar is similar. Proc. VLDB Endowment 4(10), 622–633 (2011)
Y. Lin, H. Wang, J. Li, H. Gao, Efficient entity resolution on heterogeneous records. IEEE Trans. Knowl. Data Eng. 32(5), 912–926 (2019)
M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, N. Tang, Distributed representations of tuples for entity resolution. Proc. VLDB Endowment 11(11), 1454–1467 (2018)
S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, Deep learning for entity matching: a design space exploration, in Proceedings of the 2018 International Conference on Management of Data, (2018 May 27), pp. 19–34
R.D. Gottapu, C. Dagli, B. Ali, Entity resolution using convolutional neural network. Proc. Comput. Sci. 95, 153–158 (2016)
S. Thirumuruganathan, S.A. Parambath, M. Ouzzani, N. Tang, S. Joty, Reuse and adaptation for entity resolution through transfer learning. arXiv preprint arXiv:1809.11084 (2018 Sep 28)
A.K. Elmagarmid, P.G. Ipeirotis, V.S. Verykios, Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–6 (2006)
Y. Kim, Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014 Aug 25)
Y. Zhang, B. Wallace, A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015 Oct 13)
Y. Wang, M. Huang, X. Zhu, L. Zhao, Attention-based LSTM for aspect-level sentiment classification, in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, (2016 Nov), pp. 606–615
A. McCallum. Cora dataset, https://doi.org/10.18738/T8/HUIG48 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, X., Talburt, J.R., Li, T., Liu, X. (2021). When Entity Resolution Meets Deep Learning, Is Similarity Measure Necessary?. In: Arabnia, H.R., Ferens, K., de la Fuente, D., Kozerenko, E.B., Olivas Varela, J.A., Tinetti, F.G. (eds) Advances in Artificial Intelligence and Applied Cognitive Computing. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-70296-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-70296-0_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-70295-3
Online ISBN: 978-3-030-70296-0
eBook Packages: Computer ScienceComputer Science (R0)