A genetic algorithm based entity resolution approach with active learning

Sun, Chenchen; Shen, Derong; Kou, Yue; Nie, Tiezheng; Yu, Ge

doi:10.1007/s11704-015-5276-6

A genetic algorithm based entity resolution approach with active learning

Research Article
Published: 07 April 2017

Volume 11, pages 147–159, (2017)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Chenchen Sun¹,
Derong Shen¹,
Yue Kou¹,
Tiezheng Nie¹ &
…
Ge Yu¹

166 Accesses
1 Citation
Explore all metrics

Abstract

Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing approaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based entity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes’ comparisons with proper thresholds. We use active learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity resolution approaches in accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on large language model based autonomous agents

Article Open access 22 March 2024

Lei Wang, Chen Ma, … Jirong Wen

Knowledge Graphs: Opportunities and Challenges

Article Open access 03 April 2023

Ciyuan Peng, Feng Xia, … Francesco Osborne

Information extraction from electronic medical documents: state of the art and future research directions

Article 08 November 2022

Mohamed Yassine Landolsi, Lobna Hlaoua & Lotfi Ben Romdhane

References

Chen J C, Chen Y G, Du X Y, Li C P, Lu J H, Zhao S Y, Zhou X. Big data challenge: a data management perspective. Frontiers of Computer Science, 2013, 7(2): 157–164
Article MathSciNet Google Scholar
Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210
Article MATH Google Scholar
Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16
Article Google Scholar
Kopcke H, Rahm E. Frameworks for entity matching: a comparison. Data and Knowledge Engineering, 2010, 69(2): 197–210
Article Google Scholar
Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278
Google Scholar
Monge A E, Elkan C. The field matching problem: algorithms and applications. In: Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining. 1996, 267–270
Google Scholar
Pinheiro J C, Sun D X. Methods for linking and mining massive heterogeneous databases. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1998, 309–313
Google Scholar
Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48
Google Scholar
Minton S N, Nanjo C, Knoblock C A, Michalowski M, Michelson M. A heterogeneous field matching method for record linkage. In: Proceedings of the 5th IEEE International Conference on Data Mining. 2005, 8
Google Scholar
Sun C, Shen D, Kou Y, Nie T, Yu G. ERGP: A Combined Entity Resolution Approach with Genetic Programming. In: Proceedings of the 11th IEEE Web Information System and Application Conference. 2014, 215–220
Google Scholar
Li P, Dong X L, Maurino A, Srivastava D. Linking temporal records. Frontiers of Computer Science, 2012, 6(3): 293–312
MathSciNet MATH Google Scholar
Sun C C, Shen D R, Kou Y, Nie T Z, Yu G. GB-JER: A Graph-Based Model for Joint Entity Resolution. In: Proceedings of the 20th International Conference on Database Systems for Advanced Applications. 2015, 458–473
Google Scholar
Tejada S, Knoblock C A, Minton S. Learning object identification rules for information integration. Information Systems, 2001, 26(8): 607–633
Article MATH Google Scholar
Winkler W E. Methods for record linkage and bayesian networks. Technical report, Series RRS2002/05. 2002
Google Scholar
De Carvalho M G, Gonçalves M A, Laender A H F, Da Silva A S. Learning to deduplicate. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. 2006, 41–50
Chapter Google Scholar
De Carvalho MG, Laender A H, Gonçalves MA, Da Silva A S. Replica identification using genetic programming. In: Proceedings of the 2008 ACM Symposium on Applied Computing. 2008, 1801–1806
Chapter Google Scholar
De Carvalho M G, Laender A H, Gonçalves M A, Da Silva A S. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(3): 399–412
Article Google Scholar
Isele R, Bizer C. Learning linkage rules using genetic programming. In: Proceedings of the 6th International Workshop on Ontology Matching. 2011, 13–24
Google Scholar
Isele R, Bizer C. Learning expressive linkage rules using genetic programming. The Very Large Databases Endowment, 2012, 5(11): 1638–1649
Google Scholar
Banzhaf W, Nordin P, Keller R E, Francone F D. Genetic programming: an introduction. San Francisco: Morgan Kaufmann, 1998
Book MATH Google Scholar
Poli R, Langdon W B, McPhee N F, Koza J R. A field guide to genetic programming. Lulu.com, 2008
Google Scholar
Liere R, Tadepalli P. Active learning with committees for text categorization. In: Proceedings of the 14th National Conference on Artificial Intelligence. 1997, 591–596
Google Scholar
Cohn D, Atlas L, Ladner R. Improving generalization with active learning. Machine Learning, 1994, 15(2): 201–221
Google Scholar
Bellare K, Suresh I, Parameswaran A, Rastogi V. Active sampling for entity matching with guarantees. ACM Transactions on Knowledge Discovery from Data, 2013, 7(3): 12
Article Google Scholar
Arasu A, Gotz M, Kaushik R. On active learning of record matching packages. In: Proceedings of the 2010 International Conference on Management of Data. 2010, 783–794
Google Scholar
Koza J R. Genetic Programming: on the Programming of Computers by Means of Natural Selection. Boston: MIT Press, 1992
MATH Google Scholar
Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In: Proceedings of the 9th ACM SIGKDDWorkshop on Data Cleaning and Object Consolidation. 2003, 73–78
Google Scholar
Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. New York: ACM Press, 1999
Google Scholar
Blickle T, Thiele L. A comparison of selection schemes used in genetic algorithms. TIK-Report No.11, 1995
Google Scholar
Monge A E, Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the 2nd ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery. 1997, 23–29
Google Scholar
Shannon C E. A mathematical theory of communication. Bell System Technical Journal, 1948, 27(3): 379–423
Article MathSciNet MATH Google Scholar
Hassanzadeh O, Chiang F, Lee H C, Miller R J. Framework for evaluating clustering algorithms in duplicate detection. The Very Large Databases Endowment, 2009, 2(1): 1282–1293
Google Scholar

Download references

Acknowledgements

The authors thank anonymous reviewers for their inspiring doubts and helpful suggestions during the reviewing process. This work was supported by the National Basic Research Program of China (973 Program) (2012CB316201), the Fundamental Research Funds for the Central Universities (N120816001) and the National Natural Science Foundation of China (Grant Nos. 61472070, 61402213).

Author information

Authors and Affiliations

Sohool of Information Science and Engineering, Northeastern University, Shenyang, 110819, China
Chenchen Sun, Derong Shen, Yue Kou, Tiezheng Nie & Ge Yu

Authors

Chenchen Sun
View author publications
You can also search for this author in PubMed Google Scholar
Derong Shen
View author publications
You can also search for this author in PubMed Google Scholar
Yue Kou
View author publications
You can also search for this author in PubMed Google Scholar
Tiezheng Nie
View author publications
You can also search for this author in PubMed Google Scholar
Ge Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenchen Sun.

Additional information

This work is an extended version of the 11th IEEE Web Information System and Application Conference paper

Chenchen Sun is a PhD candidate in the College of Information Science and Engineering, Northeastern University, China. He received BS and MS from the same university in 2010 and 2012, respectively. His research interest is entity resolution.

Derong Shen is a professor and PhD supervisor in the College of Information Science and Engineering, Northeastern University, China, from where she received her PhD in 2004. She received her BS and MS from Jilin University, China in 1987 and 1990, respectively. Her interests include distributed data management and data integration.

Yue Kou is an associate professor in the College of Information Science and Engineering, Northeastern University, China, from where she also received her BS, MS, and PhD in 2002, 2005, and 2009, respectively. Her interests include entity search and data mining.

Tiezheng Nie is an associate professor in the College of Information Science and Engineering, Northeastern University, China, from where he received his BS, MS, and PhD in 2002, 2005, and 2009, respectively. His interests include data quality and data integration.

Ge Yu is a professor and PhD supervisor in the College of Information Science and Engineering, Northeastern University, China, from where he received his BS and MS in 1982 and 1985, respectively. He received his PhD from Kyushu University of Japan, Japan in 1996. He is a senior member of the CCF, and a member of the ACM, IEEE. His interests include databases and big data management.

Electronic supplementary material

Supplementary material, approximately 212 KB.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, C., Shen, D., Kou, Y. et al. A genetic algorithm based entity resolution approach with active learning. Front. Comput. Sci. 11, 147–159 (2017). https://doi.org/10.1007/s11704-015-5276-6

Download citation

Received: 06 July 2015
Accepted: 19 October 2015
Published: 07 April 2017
Issue Date: February 2017
DOI: https://doi.org/10.1007/s11704-015-5276-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A genetic algorithm based entity resolution approach with active learning

Abstract

Access this article

Similar content being viewed by others

A survey on large language model based autonomous agents

Knowledge Graphs: Opportunities and Challenges

Information extraction from electronic medical documents: state of the art and future research directions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 212 KB.

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A genetic algorithm based entity resolution approach with active learning

Abstract

Access this article

Similar content being viewed by others

A survey on large language model based autonomous agents

Knowledge Graphs: Opportunities and Challenges

Information extraction from electronic medical documents: state of the art and future research directions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 212 KB.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation