Abstract
Entity resolution (ER), precisely identifying different representations of the same real-world entities, is critical for data integration. The ER question has been studied for many years, and many methods have been proposed to solve it. Although deep learning has achieved good performance in ER tasks, there are some challenges regarding manual labeling and model transfer. This paper proposes a novel ER model, Transformer-based Denoising Adversarial Variational Entity Resolution (TdavER). For entity embedding, we develop an unsupervised entity embedding model based on denoising autoencoders and pre-trained language models, which takes corrupted input as training data to motivate the encoder to generate rather stable and robust high-quality entity representations. Furthermore, we propose an unsupervised entity feature transformation model based on adversarial variational autoencoders to ease the constraints on entity representations from training data. This transformation model converts low-level entity embeddings to high-level probability distributions, which are not constrained by the source data and contain deep similarity features. To better implement the feature transformation, we adopt adversarial networks to optimize the variational autoencoder’s training process and help it learn the correct posterior distribution. Extensive experiments confirms that the performance of our proposed TdavER is comparable with the current state-of-the-art ER methods and that its entity feature transformation model is transferable.
Similar content being viewed by others
Data Availability
Source code and data for MyRDF are available from Github.Footnote 3 Comparison method source code from GithubFootnote 4,Footnote 5 and website.Footnote 6
Notes
Public datasets, together with their training/test instances, available at www.github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md
All experiments have been run using Tensorflow1.15.0 on a Python 3 Linux Server with 128 GB RAM and GPU acceleration.
References
Arasu, A., Götz, M., & Kaushik, R. (2010). On active learning of record matching packages. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, pp 783–794, DOI https://doi.org/10.1145/1807167.1807252
Barlaug, N., & Gulla, J.A. (2021). Neural networks for entity matching: a survey. ACM Transactions on Knowledge Discovery from Data, 15(3), 1–37. https://doi.org/10.1145/3442200.
Beal, M.J. (2003). Variational algorithms for approximate Bayesian inference. United Kingdom: University of London, University College London.
Bilenko, M., & Mooney, R.J. (2003). Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, pp 39–48, DOI https://doi.org/10.1145/956750.956759
Bogatu, A., Paton, N.W., Douthwaite, M., & et al. (2021). Cost-effective variational active entity resolution. In 2021 IEEE 37th International Conference on Data Engineering. IEEE, pp 1272–1283, DOI https://doi.org/10.1109/ICDE51399.2021.00114
Cappuzzo, R., Papotti, P., & Thirumuruganathan, S. (2020). Creating embeddings of heterogeneous relational datasets for data integration tasks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, pp 1335–1349, DOI https://doi.org/10.1145/3318464.3389742
Christen, P. (2012). In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, DOI https://doi.org/10.5555/2344108
Devlin, J., Chang, M.W., Lee, K., & et al. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding, DOI https://doi.org/10.48550/ARXIV.1810.04805
Dumais, S.T., et al. (2004). Latent semantic analysis. Annual review of information science and technology, 38(1), 188–230.
Ebraheem, M., Thirumuruganathan, S., Joty, S., & et al. (2018). Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, 11(11), 1454–1467. https://doi.org/10.14778/3236187.3236198.
Elmagarmid, A.K., Ipeirotis, P.G., & Verykios, V.S. (2006). Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16. https://doi.org/10.1109/TKDE.2007.250581.
Fan, W., Gao, H., Jia, X., & et al. (2011). Dynamic constraints for record matching. The VLDB Journal, 20(4), 495–520. https://doi.org/10.1007/s00778-010-0206-6.
Fellegi, I.P., & Sunter, A.B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210. https://doi.org/10.1080/01621459.1969.10501049.
Gallego, G., Cuevas, C., Mohedano, R., & et al. (2013). On the mahalanobis distance classification criterion for multidimensional normal distributions. IEEE Transactions on Signal Processing, 61(17), 4387–4396. https://doi.org/10.1109/TSP.2013.2269047.
Garcia-Molina, H. (2004). Entity resolution: Overview and challenges. In International Conference on Conceptual Modeling. Springer, pp 1–2, DOI https://doi.org/10.1007/978-3-540-30464-7_1
Goodfellow, I., Pouget-Abadie, J., Mirza, M., & et al. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144. https://doi.org/10.1145/3422622.
Gottapu, R.D., Dagli, C., & Ali, B. (2016). Entity resolution using convolutional neural network. Procedia Computer Science, 95, 153–158. https://doi.org/10.1016/j.procs.2016.09.306.
Guo, S., Dong, X.L., Srivastava, D., & et al. (2010). Record linkage with uniqueness constraints and erroneous values. Proceedings of the VLDB Endowment, 3(1-2), 417–428. https://doi.org/10.14778/1920841.1920897.
Hou, B., Chen, Q., Wang, Y., & et al. (2019). Gradual machine learning for entity resolution. In The World Wide Web Conference. Association for Computing Machinery, p 3526–3530, DOI https://doi.org/10.1145/3308558.3314121
Kasai, J., Qian, K., Gurajada, S., & et al. (2019). Low-resource deep entity resolution with transfer and active learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp 5851–5861, DOI https://doi.org/10.18653/v1/P19-1586
Kingma, D.P., & Welling, M. (2013). Auto-encoding variational bayes. https://doi.org/10.48550/ARXIV.1312.6114.
Konda, P., Das, S., Doan, A., & et al. (2016). Magellan: toward building entity matching management systems over data science stacks. Proceedings of the VLDB Endowment, 9(13), 1581–1584. https://doi.org/10.14778/3007263.3007314.
Li, Y., Li, J., Suhara, Y., & et al. (2020). Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment, 14(1), 50–60. https://doi.org/10.14778/3421424.3421431.
Liu, Y., Ott, M., Goyal, N., & et al. (2019). Roberta: A robustly optimized bert pretraining approach, DOI https://doi.org/10.48550/arXiv.1907.11692
Mallasto, A., & Feragen, A. (2017). Learning from uncertain curves: The 2-wasserstein metric for gaussian processes. In Advances in Neural Information Processing Systems, vol 30. Curran Associates.
Maskat, R., Paton, N.W., & Embury, S.M. (2016). Pay-as-you-go configuration of entity resolution. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XXIX. Springer, p 40–65, DOI https://doi.org/10.1007/978-3-662-54037-4_2
Mescheder, L., Nowozin, S., & Geiger, A. (2017). Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, vol 70. Proceedings of Machine Learning Research, pp 2391–2400.
Mudgal, S, Li, H, Rekatsinas, T, & et al. (2018). Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data. Association for Computing Machinery, pp 19–34, DOI https://doi.org/10.1145/3183713.3196926
Neculoiu, P., Versteegh, M., & Rotaru, M. (2016). Learning text similarity with siamese recurrent networks. In Proceedings of the 1st Workshop on Representation Learning for NLP, pp 148–157.
On, B.-W., Lee, I., Choi, G.S., & Park, H.S. (2014). Discriminative and deterministic approaches towards entity resolution. Journal of Intelligent Information Systems, 43(1), 101–127. https://doi.org/10.1007/s10844-014-0308-5.
Pixton, B., & Giraud-Carrier, C. (2006). Using structured neural networks for record linkage. In Proceedings of the sixth annual workshop on technology for family history and genealogical research.
Primpeli, A., & Bizer, C. (2021). Graph-boosted active learning for multi-source entity resolution. In International Semantic Web Conference, Springer. Springer International Publishing, pp 182–199, DOI https://doi.org/10.1007/978-3-030-88361-4_11
Sanh, V., Debut, L., Chaumond, J., & et al. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, DOI https://doi.org/10.48550/arXiv.1910.01108
Uppada, S.K., Patel, P., & B, S (2022). The role of transitive closure in evaluating blocking methods for dirty entity resolution. Journal of Intelligent Information Systems, 58(3), 561–590. https://doi.org/10.1007/s10844-021-00676-3.
Vieira, P.K.M., Lóscio, B F, & Salgado, A.C. (2019). Incremental entity resolution process over query results for data integration systems. Journal of Intelligent Information Systems, 52(2), 451–471. https://doi.org/10.1007/s10844-019-00544-1.
Vincent, P., Larochelle, H., Lajoie, I., & et al. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(12), 3371–3408.
Wang, K., Reimers, N., & Gurevych, I. (2021). Tsdae: Using transformer-based sequential denoising auto-encoder for unsupervised sentence embedding learning, DOI https://doi.org/10.48550/ARXIV.2104.06979
Whang, S.E., & Garcia-Molina, H. (2013). Joint entity resolution on multiple datasets. The VLDB Journal, 22(6), 773–795. https://doi.org/10.1007/s00778-013-0308-z.
Wu, R., Chaba, S., Sawlani, S., & et al. (2020). Zeroer: Entity resolution using zero labeled examples. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, pp 1149–1164, DOI https://doi.org/10.1145/3318464.3389743
Acknowledgements
The authors would like to thank the faculty members involved in the Big Data Lab for their comments and observations on the manuscript.
Funding
This work is supported by National Natural Science Foundation of China (61672470), Major Public Welfare Projects in Henan Province, China (201300210200) and Key Scientific Research of Colleges and Universities in Henan Province (22B520047).
Author information
Authors and Affiliations
Contributions
Shuaichao Li is primarily accountable for experimental implementation and writing the full-text manuscript. Huaiguang Wu is mainly responsible for the architectural design and content review of the full-text manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, S., Wu, H. Transformer-based Denoising Adversarial Variational Entity Resolution. J Intell Inf Syst 61, 631–650 (2023). https://doi.org/10.1007/s10844-022-00773-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-022-00773-x