Skip to main content
Log in

A genetic algorithm based entity resolution approach with active learning

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing approaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based entity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes’ comparisons with proper thresholds. We use active learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity resolution approaches in accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Chen J C, Chen Y G, Du X Y, Li C P, Lu J H, Zhao S Y, Zhou X. Big data challenge: a data management perspective. Frontiers of Computer Science, 2013, 7(2): 157–164

    Article  MathSciNet  Google Scholar 

  2. Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210

    Article  MATH  Google Scholar 

  3. Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16

    Article  Google Scholar 

  4. Kopcke H, Rahm E. Frameworks for entity matching: a comparison. Data and Knowledge Engineering, 2010, 69(2): 197–210

    Article  Google Scholar 

  5. Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278

    Google Scholar 

  6. Monge A E, Elkan C. The field matching problem: algorithms and applications. In: Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining. 1996, 267–270

    Google Scholar 

  7. Pinheiro J C, Sun D X. Methods for linking and mining massive heterogeneous databases. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1998, 309–313

    Google Scholar 

  8. Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48

    Google Scholar 

  9. Minton S N, Nanjo C, Knoblock C A, Michalowski M, Michelson M. A heterogeneous field matching method for record linkage. In: Proceedings of the 5th IEEE International Conference on Data Mining. 2005, 8

    Google Scholar 

  10. Sun C, Shen D, Kou Y, Nie T, Yu G. ERGP: A Combined Entity Resolution Approach with Genetic Programming. In: Proceedings of the 11th IEEE Web Information System and Application Conference. 2014, 215–220

    Google Scholar 

  11. Li P, Dong X L, Maurino A, Srivastava D. Linking temporal records. Frontiers of Computer Science, 2012, 6(3): 293–312

    MathSciNet  MATH  Google Scholar 

  12. Sun C C, Shen D R, Kou Y, Nie T Z, Yu G. GB-JER: A Graph-Based Model for Joint Entity Resolution. In: Proceedings of the 20th International Conference on Database Systems for Advanced Applications. 2015, 458–473

    Google Scholar 

  13. Tejada S, Knoblock C A, Minton S. Learning object identification rules for information integration. Information Systems, 2001, 26(8): 607–633

    Article  MATH  Google Scholar 

  14. Winkler W E. Methods for record linkage and bayesian networks. Technical report, Series RRS2002/05. 2002

    Google Scholar 

  15. De Carvalho M G, Gonçalves M A, Laender A H F, Da Silva A S. Learning to deduplicate. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. 2006, 41–50

    Chapter  Google Scholar 

  16. De Carvalho MG, Laender A H, Gonçalves MA, Da Silva A S. Replica identification using genetic programming. In: Proceedings of the 2008 ACM Symposium on Applied Computing. 2008, 1801–1806

    Chapter  Google Scholar 

  17. De Carvalho M G, Laender A H, Gonçalves M A, Da Silva A S. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(3): 399–412

    Article  Google Scholar 

  18. Isele R, Bizer C. Learning linkage rules using genetic programming. In: Proceedings of the 6th International Workshop on Ontology Matching. 2011, 13–24

    Google Scholar 

  19. Isele R, Bizer C. Learning expressive linkage rules using genetic programming. The Very Large Databases Endowment, 2012, 5(11): 1638–1649

    Google Scholar 

  20. Banzhaf W, Nordin P, Keller R E, Francone F D. Genetic programming: an introduction. San Francisco: Morgan Kaufmann, 1998

    Book  MATH  Google Scholar 

  21. Poli R, Langdon W B, McPhee N F, Koza J R. A field guide to genetic programming. Lulu.com, 2008

    Google Scholar 

  22. Liere R, Tadepalli P. Active learning with committees for text categorization. In: Proceedings of the 14th National Conference on Artificial Intelligence. 1997, 591–596

    Google Scholar 

  23. Cohn D, Atlas L, Ladner R. Improving generalization with active learning. Machine Learning, 1994, 15(2): 201–221

    Google Scholar 

  24. Bellare K, Suresh I, Parameswaran A, Rastogi V. Active sampling for entity matching with guarantees. ACM Transactions on Knowledge Discovery from Data, 2013, 7(3): 12

    Article  Google Scholar 

  25. Arasu A, Gotz M, Kaushik R. On active learning of record matching packages. In: Proceedings of the 2010 International Conference on Management of Data. 2010, 783–794

    Google Scholar 

  26. Koza J R. Genetic Programming: on the Programming of Computers by Means of Natural Selection. Boston: MIT Press, 1992

    MATH  Google Scholar 

  27. Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In: Proceedings of the 9th ACM SIGKDDWorkshop on Data Cleaning and Object Consolidation. 2003, 73–78

    Google Scholar 

  28. Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. New York: ACM Press, 1999

    Google Scholar 

  29. Blickle T, Thiele L. A comparison of selection schemes used in genetic algorithms. TIK-Report No.11, 1995

    Google Scholar 

  30. Monge A E, Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the 2nd ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery. 1997, 23–29

    Google Scholar 

  31. Shannon C E. A mathematical theory of communication. Bell System Technical Journal, 1948, 27(3): 379–423

    Article  MathSciNet  MATH  Google Scholar 

  32. Hassanzadeh O, Chiang F, Lee H C, Miller R J. Framework for evaluating clustering algorithms in duplicate detection. The Very Large Databases Endowment, 2009, 2(1): 1282–1293

    Google Scholar 

Download references

Acknowledgements

The authors thank anonymous reviewers for their inspiring doubts and helpful suggestions during the reviewing process. This work was supported by the National Basic Research Program of China (973 Program) (2012CB316201), the Fundamental Research Funds for the Central Universities (N120816001) and the National Natural Science Foundation of China (Grant Nos. 61472070, 61402213).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenchen Sun.

Additional information

This work is an extended version of the 11th IEEE Web Information System and Application Conference paper

Chenchen Sun is a PhD candidate in the College of Information Science and Engineering, Northeastern University, China. He received BS and MS from the same university in 2010 and 2012, respectively. His research interest is entity resolution.

Derong Shen is a professor and PhD supervisor in the College of Information Science and Engineering, Northeastern University, China, from where she received her PhD in 2004. She received her BS and MS from Jilin University, China in 1987 and 1990, respectively. Her interests include distributed data management and data integration.

Yue Kou is an associate professor in the College of Information Science and Engineering, Northeastern University, China, from where she also received her BS, MS, and PhD in 2002, 2005, and 2009, respectively. Her interests include entity search and data mining.

Tiezheng Nie is an associate professor in the College of Information Science and Engineering, Northeastern University, China, from where he received his BS, MS, and PhD in 2002, 2005, and 2009, respectively. His interests include data quality and data integration.

Ge Yu is a professor and PhD supervisor in the College of Information Science and Engineering, Northeastern University, China, from where he received his BS and MS in 1982 and 1985, respectively. He received his PhD from Kyushu University of Japan, Japan in 1996. He is a senior member of the CCF, and a member of the ACM, IEEE. His interests include databases and big data management.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, C., Shen, D., Kou, Y. et al. A genetic algorithm based entity resolution approach with active learning. Front. Comput. Sci. 11, 147–159 (2017). https://doi.org/10.1007/s11704-015-5276-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-015-5276-6

Keywords

Navigation