Transformer-based Denoising Adversarial Variational Entity Resolution

Li, Shuaichao; Wu, Huaiguang

doi:10.1007/s10844-022-00773-x

Transformer-based Denoising Adversarial Variational Entity Resolution

Published: 17 April 2023

Volume 61, pages 631–650, (2023)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Shuaichao Li¹ &
Huaiguang Wu¹

301 Accesses
1 Citation
Explore all metrics

Abstract

Entity resolution (ER), precisely identifying different representations of the same real-world entities, is critical for data integration. The ER question has been studied for many years, and many methods have been proposed to solve it. Although deep learning has achieved good performance in ER tasks, there are some challenges regarding manual labeling and model transfer. This paper proposes a novel ER model, Transformer-based Denoising Adversarial Variational Entity Resolution (TdavER). For entity embedding, we develop an unsupervised entity embedding model based on denoising autoencoders and pre-trained language models, which takes corrupted input as training data to motivate the encoder to generate rather stable and robust high-quality entity representations. Furthermore, we propose an unsupervised entity feature transformation model based on adversarial variational autoencoders to ease the constraints on entity representations from training data. This transformation model converts low-level entity embeddings to high-level probability distributions, which are not constrained by the source data and contain deep similarity features. To better implement the feature transformation, we adopt adversarial networks to optimize the variational autoencoder’s training process and help it learn the correct posterior distribution. Extensive experiments confirms that the performance of our proposed TdavER is comparable with the current state-of-the-art ER methods and that its entity feature transformation model is transferable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generative adversarial network for unsupervised multi-lingual knowledge graph entity alignment

Article Open access 11 February 2023

Genre: generative multi-turn question answering with contrastive learning for entity–relation extraction

Article Open access 08 February 2024

Global Entity Alignment with Gated Latent Space Neighborhood Aggregation

Data Availability

Source code and data for MyRDF are available from Github.^{Footnote 3} Comparison method source code from Github^{Footnote 4}^,^{Footnote 5} and website.^{Footnote 6}

Notes

Public datasets, together with their training/test instances, available at www.github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md
All experiments have been run using Tensorflow1.15.0 on a Python 3 Linux Server with 128 GB RAM and GPU acceleration.
https://github.com/LSC-zzuli/TdavER-noval-ER-model
https://github.com/anhaidgroup/deepmatcher
https://github.com/megagonlabs/ditto
https://sites.google.com/site/anhaidgroup/current-projects/magellan

References

Arasu, A., Götz, M., & Kaushik, R. (2010). On active learning of record matching packages. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, pp 783–794, DOI https://doi.org/10.1145/1807167.1807252
Barlaug, N., & Gulla, J.A. (2021). Neural networks for entity matching: a survey. ACM Transactions on Knowledge Discovery from Data, 15(3), 1–37. https://doi.org/10.1145/3442200.
Article Google Scholar
Beal, M.J. (2003). Variational algorithms for approximate Bayesian inference. United Kingdom: University of London, University College London.
Google Scholar
Bilenko, M., & Mooney, R.J. (2003). Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, pp 39–48, DOI https://doi.org/10.1145/956750.956759
Bogatu, A., Paton, N.W., Douthwaite, M., & et al. (2021). Cost-effective variational active entity resolution. In 2021 IEEE 37th International Conference on Data Engineering. IEEE, pp 1272–1283, DOI https://doi.org/10.1109/ICDE51399.2021.00114
Cappuzzo, R., Papotti, P., & Thirumuruganathan, S. (2020). Creating embeddings of heterogeneous relational datasets for data integration tasks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, pp 1335–1349, DOI https://doi.org/10.1145/3318464.3389742
Christen, P. (2012). In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, DOI https://doi.org/10.5555/2344108
Devlin, J., Chang, M.W., Lee, K., & et al. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding, DOI https://doi.org/10.48550/ARXIV.1810.04805
Dumais, S.T., et al. (2004). Latent semantic analysis. Annual review of information science and technology, 38(1), 188–230.
Article Google Scholar
Ebraheem, M., Thirumuruganathan, S., Joty, S., & et al. (2018). Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, 11(11), 1454–1467. https://doi.org/10.14778/3236187.3236198.
Article Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., & Verykios, V.S. (2006). Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16. https://doi.org/10.1109/TKDE.2007.250581.
Article Google Scholar
Fan, W., Gao, H., Jia, X., & et al. (2011). Dynamic constraints for record matching. The VLDB Journal, 20(4), 495–520. https://doi.org/10.1007/s00778-010-0206-6.
Article Google Scholar
Fellegi, I.P., & Sunter, A.B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210. https://doi.org/10.1080/01621459.1969.10501049.
Article MATH Google Scholar
Gallego, G., Cuevas, C., Mohedano, R., & et al. (2013). On the mahalanobis distance classification criterion for multidimensional normal distributions. IEEE Transactions on Signal Processing, 61(17), 4387–4396. https://doi.org/10.1109/TSP.2013.2269047.
Article Google Scholar
Garcia-Molina, H. (2004). Entity resolution: Overview and challenges. In International Conference on Conceptual Modeling. Springer, pp 1–2, DOI https://doi.org/10.1007/978-3-540-30464-7_1
Goodfellow, I., Pouget-Abadie, J., Mirza, M., & et al. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144. https://doi.org/10.1145/3422622.
Article MathSciNet Google Scholar
Gottapu, R.D., Dagli, C., & Ali, B. (2016). Entity resolution using convolutional neural network. Procedia Computer Science, 95, 153–158. https://doi.org/10.1016/j.procs.2016.09.306.
Article Google Scholar
Guo, S., Dong, X.L., Srivastava, D., & et al. (2010). Record linkage with uniqueness constraints and erroneous values. Proceedings of the VLDB Endowment, 3(1-2), 417–428. https://doi.org/10.14778/1920841.1920897.
Article Google Scholar
Hou, B., Chen, Q., Wang, Y., & et al. (2019). Gradual machine learning for entity resolution. In The World Wide Web Conference. Association for Computing Machinery, p 3526–3530, DOI https://doi.org/10.1145/3308558.3314121
Kasai, J., Qian, K., Gurajada, S., & et al. (2019). Low-resource deep entity resolution with transfer and active learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp 5851–5861, DOI https://doi.org/10.18653/v1/P19-1586
Kingma, D.P., & Welling, M. (2013). Auto-encoding variational bayes. https://doi.org/10.48550/ARXIV.1312.6114.
Konda, P., Das, S., Doan, A., & et al. (2016). Magellan: toward building entity matching management systems over data science stacks. Proceedings of the VLDB Endowment, 9(13), 1581–1584. https://doi.org/10.14778/3007263.3007314.
Article Google Scholar
Li, Y., Li, J., Suhara, Y., & et al. (2020). Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment, 14(1), 50–60. https://doi.org/10.14778/3421424.3421431.
Article Google Scholar
Liu, Y., Ott, M., Goyal, N., & et al. (2019). Roberta: A robustly optimized bert pretraining approach, DOI https://doi.org/10.48550/arXiv.1907.11692
Mallasto, A., & Feragen, A. (2017). Learning from uncertain curves: The 2-wasserstein metric for gaussian processes. In Advances in Neural Information Processing Systems, vol 30. Curran Associates.
Maskat, R., Paton, N.W., & Embury, S.M. (2016). Pay-as-you-go configuration of entity resolution. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XXIX. Springer, p 40–65, DOI https://doi.org/10.1007/978-3-662-54037-4_2
Mescheder, L., Nowozin, S., & Geiger, A. (2017). Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, vol 70. Proceedings of Machine Learning Research, pp 2391–2400.
Mudgal, S, Li, H, Rekatsinas, T, & et al. (2018). Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data. Association for Computing Machinery, pp 19–34, DOI https://doi.org/10.1145/3183713.3196926
Neculoiu, P., Versteegh, M., & Rotaru, M. (2016). Learning text similarity with siamese recurrent networks. In Proceedings of the 1st Workshop on Representation Learning for NLP, pp 148–157.
On, B.-W., Lee, I., Choi, G.S., & Park, H.S. (2014). Discriminative and deterministic approaches towards entity resolution. Journal of Intelligent Information Systems, 43(1), 101–127. https://doi.org/10.1007/s10844-014-0308-5.
Article Google Scholar
Pixton, B., & Giraud-Carrier, C. (2006). Using structured neural networks for record linkage. In Proceedings of the sixth annual workshop on technology for family history and genealogical research.
Primpeli, A., & Bizer, C. (2021). Graph-boosted active learning for multi-source entity resolution. In International Semantic Web Conference, Springer. Springer International Publishing, pp 182–199, DOI https://doi.org/10.1007/978-3-030-88361-4_11
Sanh, V., Debut, L., Chaumond, J., & et al. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, DOI https://doi.org/10.48550/arXiv.1910.01108
Uppada, S.K., Patel, P., & B, S (2022). The role of transitive closure in evaluating blocking methods for dirty entity resolution. Journal of Intelligent Information Systems, 58(3), 561–590. https://doi.org/10.1007/s10844-021-00676-3.
Article Google Scholar
Vieira, P.K.M., Lóscio, B F, & Salgado, A.C. (2019). Incremental entity resolution process over query results for data integration systems. Journal of Intelligent Information Systems, 52(2), 451–471. https://doi.org/10.1007/s10844-019-00544-1.
Article Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., & et al. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(12), 3371–3408.
MathSciNet MATH Google Scholar
Wang, K., Reimers, N., & Gurevych, I. (2021). Tsdae: Using transformer-based sequential denoising auto-encoder for unsupervised sentence embedding learning, DOI https://doi.org/10.48550/ARXIV.2104.06979
Whang, S.E., & Garcia-Molina, H. (2013). Joint entity resolution on multiple datasets. The VLDB Journal, 22(6), 773–795. https://doi.org/10.1007/s00778-013-0308-z.
Article Google Scholar
Wu, R., Chaba, S., Sawlani, S., & et al. (2020). Zeroer: Entity resolution using zero labeled examples. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, pp 1149–1164, DOI https://doi.org/10.1145/3318464.3389743

Download references

Acknowledgements

The authors would like to thank the faculty members involved in the Big Data Lab for their comments and observations on the manuscript.

Funding

This work is supported by National Natural Science Foundation of China (61672470), Major Public Welfare Projects in Henan Province, China (201300210200) and Key Scientific Research of Colleges and Universities in Henan Province (22B520047).

Author information

Authors and Affiliations

College of Computer and Communication Engineering, Zhengzhou University of Light Industry, Kexue Avenue, Zhengzhou, 450066, Henan Province, China
Shuaichao Li & Huaiguang Wu

Authors

Shuaichao Li
View author publications
You can also search for this author in PubMed Google Scholar
Huaiguang Wu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Shuaichao Li is primarily accountable for experimental implementation and writing the full-text manuscript. Huaiguang Wu is mainly responsible for the architectural design and content review of the full-text manuscript.

Corresponding author

Correspondence to Huaiguang Wu.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, S., Wu, H. Transformer-based Denoising Adversarial Variational Entity Resolution. J Intell Inf Syst 61, 631–650 (2023). https://doi.org/10.1007/s10844-022-00773-x

Download citation

Received: 13 October 2022
Revised: 16 December 2022
Accepted: 19 December 2022
Published: 17 April 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10844-022-00773-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transformer-based Denoising Adversarial Variational Entity Resolution

Abstract

Access this article

Similar content being viewed by others

Generative adversarial network for unsupervised multi-lingual knowledge graph entity alignment

Genre: generative multi-turn question answering with contrastive learning for entity–relation extraction

Global Entity Alignment with Gated Latent Space Neighborhood Aggregation

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Transformer-based Denoising Adversarial Variational Entity Resolution

Abstract

Access this article

Similar content being viewed by others

Generative adversarial network for unsupervised multi-lingual knowledge graph entity alignment

Genre: generative multi-turn question answering with contrastive learning for entity–relation extraction

Global Entity Alignment with Gated Latent Space Neighborhood Aggregation

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation