When Entity Resolution Meets Deep Learning, Is Similarity Measure Necessary?

Li, Xinming; Talburt, John R.; Li, Ting; Liu, Xiangwen

doi:10.1007/978-3-030-70296-0_10

Xinming Li⁸,
John R. Talburt⁸,
Ting Li⁸ &
…
Xiangwen Liu⁸

Part of the book series: Transactions on Computational Science and Computational Intelligence ((TRACOSCI))

1433 Accesses

Abstract

In Entity Resolution (ER), more and more unstructured records impose challenge to the traditional similarity-based approaches, since existing similarity metrics are designed for structured records. Now that similarity is hard to measure for unstructured records, can we do pairwise matching without similarity measure? To answer this question, this research leverages deep learning’s artificial intelligence to learn the underlying record matched pattern, rather than measuring records similarity first and then making linking decision based on the similarity measure. In the representation part, token order information is taken into account in word embedding, and not considered in Bag-of-Words (Count and TF-IDF); in the model part, multilayer perceptron (MLP), convolutional neural network (CNN), and long short-term memory (LSTM) are examined. Our experiments on both synthetic data and real-world data demonstrate that, surprisingly, the simplest representation (Count) and the simplest model (MLP) together get the best results both in effectiveness and efficiency. An F-measure as high as 1.00 in the pairwise matching task shows potential for further applying deep learning in other ER tasks like blocking.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Hardcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

J.R. Talburt, Entity Resolution and Information Quality (Morgan Kaufmann, New York, USA, 2011)
Google Scholar
M.A. Hernández, S.J. Stolfo, Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Disc. 2(1), 9–37 (1998)
Article Google Scholar
T.W. Victor, R.M. Mera, Record linkage of health care insurance claims. J. Am. Med. Inform. Assoc. 8(3), 281–288 (2001)
Article Google Scholar
H. Köpcke, A. Thor, S. Thomas, E. Rahm, Tailoring entity resolution for matching product offers. In Proceedings of the 15th International Conference on Extending Database Technology, 2012 Mar 27, pp. 545–550
Google Scholar
S.E. Whang, H. Garcia-Molina, Entity resolution with evolving rules. Proc. VLDB Endowment 3(1–2), 1326–1337 (2010)
Article Google Scholar
L. Li, J. Li, H. Gao, Rule-based method for entity resolution. IEEE Trans. Knowl. Data Eng. 27(1), 250–263 (2014)
Article Google Scholar
I.P. Fellegi, A.B. Sunter, A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Article MATH Google Scholar
P. Wang, D. Pullen, J.R. Talburt, C. Chen, A method for match key blocking in probabilistic matching. In Information Technology: New Generations, pp. 847–857, 2016
Google Scholar
L. Kolb, H. Köpcke, A. Thor, E. Rahm, Learning-based entity resolution with MapReduce, in Proceedings of the Third International Workshop on Cloud Data Management, (2011 Oct 28), pp. 1–6
Google Scholar
Z. Chen, Z. Li, Gradual Machine Learning for Entity Resolution. arXiv preprint arXiv:1810.12125 (2018)
Google Scholar
I. Bhattacharya, L. Getoor, A latent Dirichlet model for unsupervised entity resolution. In Proceedings of the 2006 SIAM International Conference on Data Mining, 2006 Apr 20, pp. 47–58
Google Scholar
S. Song, L. Chen, Probabilistic correlation-based similarity measure of unstructured records. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, 2007 Nov 6, pp. 967–970
Google Scholar
J. Wang, G. Li, J.X. Yu, J. Feng, Entity matching: how similar is similar. Proc. VLDB Endowment 4(10), 622–633 (2011)
Article Google Scholar
Y. Lin, H. Wang, J. Li, H. Gao, Efficient entity resolution on heterogeneous records. IEEE Trans. Knowl. Data Eng. 32(5), 912–926 (2019)
Google Scholar
M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, N. Tang, Distributed representations of tuples for entity resolution. Proc. VLDB Endowment 11(11), 1454–1467 (2018)
Article Google Scholar
S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, Deep learning for entity matching: a design space exploration, in Proceedings of the 2018 International Conference on Management of Data, (2018 May 27), pp. 19–34
Chapter Google Scholar
R.D. Gottapu, C. Dagli, B. Ali, Entity resolution using convolutional neural network. Proc. Comput. Sci. 95, 153–158 (2016)
Article Google Scholar
S. Thirumuruganathan, S.A. Parambath, M. Ouzzani, N. Tang, S. Joty, Reuse and adaptation for entity resolution through transfer learning. arXiv preprint arXiv:1809.11084 (2018 Sep 28)
Google Scholar
A.K. Elmagarmid, P.G. Ipeirotis, V.S. Verykios, Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–6 (2006)
Article Google Scholar
Y. Kim, Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014 Aug 25)
Google Scholar
Y. Zhang, B. Wallace, A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015 Oct 13)
Google Scholar
Y. Wang, M. Huang, X. Zhu, L. Zhao, Attention-based LSTM for aspect-level sentiment classification, in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, (2016 Nov), pp. 606–615
Chapter Google Scholar
A. McCallum. Cora dataset, https://doi.org/10.18738/T8/HUIG48 (2017)

Download references

Author information

Authors and Affiliations

Information Science, University of Arkansas at Little Rock, Little Rock, AR, USA
Xinming Li, John R. Talburt, Ting Li & Xiangwen Liu

Authors

Xinming Li
View author publications
You can also search for this author in PubMed Google Scholar
John R. Talburt
View author publications
You can also search for this author in PubMed Google Scholar
Ting Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiangwen Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xinming Li .

Editor information

Editors and Affiliations

Department of Computer Science, University of Georgia, Athens, GA, USA
Hamid R. Arabnia
Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, MB, Canada
Ken Ferens
Business Administration, University of Oviedo, Oviedo, Asturias, Spain
David de la Fuente
Institute of Informatics Problems, The Russian Academy of Sciences, Moscow, Russia
Elena B. Kozerenko
Technology and Information systems, Universidad de Castilla La Mancha, Ciudad Real, Ciudad Real, Spain
José Angel Olivas Varela
Facultad de Informática - CIC PBA, Universidad Nacional de La Plata, La Plata, Argentina
Fernando G. Tinetti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X., Talburt, J.R., Li, T., Liu, X. (2021). When Entity Resolution Meets Deep Learning, Is Similarity Measure Necessary?. In: Arabnia, H.R., Ferens, K., de la Fuente, D., Kozerenko, E.B., Olivas Varela, J.A., Tinetti, F.G. (eds) Advances in Artificial Intelligence and Applied Cognitive Computing. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-70296-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-70296-0_10
Published: 15 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-70295-3
Online ISBN: 978-3-030-70296-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics