EnAli: entity alignment across multiple heterogeneous data sources
- 14 Downloads
Abstract
Entity alignment is the problem of identifying which entities in a data source refer to the same real-world entity in the others. Identifying entities across heterogeneous data sources is paramount to many research fields, such as data cleaning, data integration, information retrieval and machine learning. The aligning process is not only overwhelmingly expensive for large data sources since it involves all tuples from two or more data sources, but also need to handle heterogeneous entity attributes. In this paper, we propose an unsupervised approach, called EnAli, to match entities across two or more heterogeneous data sources. EnAli employs a generative probabilistic model to incorporate the heterogeneous entity attributes via employing exponential family, handle missing values, and also utilize the locality sensitive hashing schema to reduce the candidate tuples and speed up the aligning process. EnAli is highly accurate and efficient even without any ground-truth tuples. We illustrate the performance of EnAli on re-identifying entities from the same data source, as well as aligning entities across three real data sources. Our experimental results manifest that our proposed approach outperforms the comparable baseline.
Keywords
entity alignment exponential family locality sensitive hashing EM-algorithmPreview
Unable to display preview. Download preview PDF.
Notes
Acknowledgements
This work has been supported by the National Key Research and Development Program of China (2016YFB1000905), the National Natural Science Foundation of China (Grant Nos. U1401256, 61402177, 61672234, 61402180 and 61232002). This work was also supported by NSF of Shanghai (14ZR1412600).
Supplementary material
References
- 1.Scannapieco M, Figotin I, Bertino E, Elmagarmid A K. Privacy preserving schema and data matching. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2007, 653–664Google Scholar
- 2.Getoor L, Machanavajjhala A. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12): 2018–2019CrossRefGoogle Scholar
- 3.Zafarani R, Liu H. Connecting corresponding identities across communities. In: Proceedings of International Conference on Weblogs and Social Media. 2009, 354–357Google Scholar
- 4.Tantipathananandh C, Berger-Wolf T Y. Constant-factor approximation algorithms for identifying dynamic communities. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 827–836CrossRefGoogle Scholar
- 5.Zhang JW, Yu P S. Integrated anchor and social link predictions across social networks. In: Proceedings of International Joint Conference on Artificial Intelligence. 2015, 2125–2131Google Scholar
- 6.Zhang J W, Yu P S. PCT: partial co-alignment of social networks. In: Proceedings of International Conference on World Wide Web. 2016, 749–759CrossRefGoogle Scholar
- 7.Gao M, Lim E P, Lo D, Zhu F D, Prasetyo P K, Zhou A Y. CNL: collective network linkage across heterogeneous social network. In: Proceedings of IEEE International Conference on Data Mining. 2015, 757–762Google Scholar
- 8.Kong C, Gao M, Xu C, Qian W N, Zhou A Y. Entity matching across multiple heterogeneous data sources. In: Proceedings of International Conference on Database Systems for Advanced Applications. 2016, 133–146CrossRefGoogle Scholar
- 9.Newcombe H B, Kennedy J M, Axford S J, James A P. Automatic linkage of vital records. Science, 1959, 130(3381): 954–959CrossRefGoogle Scholar
- 10.Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278Google Scholar
- 11.Wang Y R, Madnick S E. The inter-database instance identification problem in integrating autonomous systems. In: Proceedings of International Conference on Data Engineering. 1989, 46–55Google Scholar
- 12.Hernandez M A, Stolfo S J. The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 1995, 127–138Google Scholar
- 13.Jin L, Li C, Mehrotra S. Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web-internet & Web Information Systems, 2006, 9(4): 557–584Google Scholar
- 14.Whang S E, Garcia-Molina H. Incremental entity resolution on rules and data. The VLDB Journal, 2014, 23(1): 77–102CrossRefGoogle Scholar
- 15.Kolb L, Thor A, Rahm E. Block-based load balancing for entity resolution with MapReduce. In: Proceedings of ACM Conference on Information and Knowledge Management. 2011, 2397–2400Google Scholar
- 16.Whang S E, Garcia-Molina H. Entity resolution with evolving rules. Proceedings of the VLDB Endowment, 2010, 3(1–2): 1326–1337CrossRefGoogle Scholar
- 17.Singla P, Domingos P M. Entity resolution with markov logic. In: Proceedings of IEEE International Conference on Data Mining. 2006, 572–582Google Scholar
- 18.Tejada S, Knoblock C A, Minton S. Learning object identification rules for information integration. Information Systems, 2001, 26(8): 607–633CrossRefzbMATHGoogle Scholar
- 19.Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection. Berlin: Springer Heidelberg, 2012CrossRefGoogle Scholar
- 20.Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16CrossRefGoogle Scholar
- 21.Winkler W E. Overview of record linkage and current research directions. Bureau of the Census, 2006, 25(4): 603–623Google Scholar
- 22.Wang J N, Li G L, Yu J X, Feng J H. Entity matching: how similar is similar. Proceedings of the VLDB Endowment, 2011, 4(10): 622–633CrossRefGoogle Scholar
- 23.Bilenko M, Mooney R. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48Google Scholar
- 24.Dong X, Halevy A Y, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2005, 85–96Google Scholar
- 25.Roos L L, Wajda A. Record linkage strategies. Part I: estimating information and evaluating approaches. Methods of Information in Medicine, 1991, 30(2): 117Google Scholar
- 26.Grannis S J, Overhage J M, McDonald C J. Analysis of identifier performance using a deterministic linkage algorithm. In: Proceedings of American Medical Informatics Association Annual Symposium. 2002, 305–309Google Scholar
- 27.Rastogi V, Dalvi Ni N, Garofalakis M N. Large-scale collective entity matching. Proceedings of the VLDB Endowment, 2011, 4(4): 208–218CrossRefGoogle Scholar
- 28.Lee S, Lee J, Hwang S. Scalable entity matching computation with materialization. In: Proceedings of ACM Conference on Information and Knowledge Management. 2011, 2353–2356Google Scholar
- 29.Liu J, Zhang F, Song X Y, Song Y I, Lin C Y, Hon H W. What’s in a name? an unsupervised approach to link users across communities. In: Proceedings of ACM International Conference on Web Search and Data Mining. 2013, 495–504Google Scholar
- 30.Liu S Y, Wang S H, Zhu F D, Zhang J B, Krishnan R. HYDRA: largescale social identity linkage via heterogeneous behavior modeling. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2014, 51–62Google Scholar
- 31.Zafarani R, Liu H. Connecting users across social media sites: a behavioral-modeling approach. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 41–49CrossRefGoogle Scholar
- 32.Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210CrossRefzbMATHGoogle Scholar
- 33.DuVall S L, Kerber R A, Thomas A. Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators. Journal of Biomedical Informatics, 2010, 43(1): 24–30CrossRefGoogle Scholar
- 34.Sadinle M, Fienberg S E. A generalized fellegi-sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 2013, 108(502): 385–397MathSciNetCrossRefzbMATHGoogle Scholar
- 35.Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engi neering, 2012, 24(9): 1537–1555CrossRefGoogle Scholar
- 36.Leskovec J, Rajaraman A, Ullman J D. Mining of Massive Datasets. Cambridge: Cambridge University Press, 2011Google Scholar
- 37.Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2006, 802–803Google Scholar
- 38.Zheng W G, Zou L, Feng Y S, Chen L, Zhao D Y. Efficient simrank-based similarity join over large graphs. Proceedings of the VLDB Endowment, 2013, 6(7): 493–504CrossRefGoogle Scholar
- 39.Zafarani R, Liu H. Connecting users across social media sites: a behavioral-modeling approach. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 41–49CrossRefGoogle Scholar
- 40.Blei D, Ng A, Jordan M. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993–1022zbMATHGoogle Scholar