Skip to main content

Advertisement

Log in

GAEMTBD: Genetic algorithm based entity matching techniques for bibliographic databases

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Entity matching is to map the records in a database to their corresponding entities. It is a well-known problem in the field of database and artificial intelligence. In digital libraries such as DBLP, ArnetMiner, Google Scholar, Scopus, Web of Science, AllMusic, IMDB, etc., some of the attributes may evolve over time, i.e., they change their values at different instants of time. For example, affiliation and email-id of an author in bibliographic databases which maintain publication details of various authors like DBLP, ArnetMiner, etc. may change their values. A taxpayer can change his or her address over time. Sometimes people change their surnames due to marriage. When a database contains records of these natures and the number of records grows beyond a limit, then it becomes really challenging to identify which records belong to which entity due to the lack of a proper key. In the current paper, the problem of automatic partitioning of records is posed as an optimization problem. Thereafter, a genetic algorithm based automatic technique is proposed to solve the entity matching problem. The proposed approach is able to automatically determine the number of partitions available in a bibliographic dataset. A comparative analysis with the two existing systems – DBLP and ArnetMiner, over sixteen bibliographic datasets proves the efficacy of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://www.informatik.uni-trier.de/_ley/db/

  2. http://arnetminer.org/

References

  1. Baarsch J, Celebi ME (2012) Investigation of internal validity measures for k-means clustering. Proceedings of the international multiconference of engineers and computer scientists, vol 1, pp 14–16

  2. Bandyopadhyay S, Saha S (2012) Unsupervised classification: similarity measures, classical and metaheuristic approaches, and applications. Springer

  3. Bhandari D, Murthy C, Pal SK (1996) Genetic algorithm with elitist model and its convergence. Int J Pattern Recognit Artif Intell 10(06):731–747

    Article  Google Scholar 

  4. Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 39–48

  5. Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3(1):1– 27

    Article  MathSciNet  MATH  Google Scholar 

  6. Chaudhuri S, Chen BC, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. Proceedings of the 33rd international conference on Very large data bases, VLDB Endowment, pp 327–338

    Google Scholar 

  7. Chou CH, Su MC, Lai E (2002) Symmetry as a new measure for cluster validity. 2nd WSEAS Int. Conf. on Scientific Computation and Soft Computing, pp 209–213

    Google Scholar 

  8. Chou CH, Su MC, Lai E (2004) A new cluster validity measure and its application to image compression. Pattern Anal Applic 7(2):205–220

    Article  MathSciNet  Google Scholar 

  9. Cramer NL (1985) A representation for the adaptive generation of simple sequential programs. Proceedings of the First International Conference on Genetic Algorithms, pp 183– 187

    Google Scholar 

  10. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2:224–227

    Article  Google Scholar 

  11. De Carvalho MG, Laender AH, Gonċalves M A, Da Silva AS (2012) A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering 24(3): 399–412

    Article  Google Scholar 

  12. DeRose P, Shen W, Chen F, Lee Y, Burdick D, Doan A, Ramakrishnan R (2007) Dblife: A community information management platform for the database research community. CIDR , pp 169–172

    Google Scholar 

  13. Diaz-Valenzuela I, Martin-Bautista MJ, Vila MA, Campaña JR (2013) An automatic system for identifying authorities in digital libraries. Expert Syst Appl 40(10):3994–4002

    Article  Google Scholar 

  14. Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters

  15. Eberhart RC, Kennedy J (1995) A new optimizer using particle swarm theory. In: Proceedings of the 6th International Symposium on Micro Machine and Human Science, New York, vol 1, pp 39– 43

  16. Eshelman LJ (ed.) (1995) Proceedings of the 6th International Conference on Genetic Algorithms, Pittsburgh, PA, USA, July 15–19, 1995, Morgan Kaufmann

  17. Fan W, Jia X, Li J, Ma S (2009) Reasoning about record matching rules. Proceedings of the VLDB Endowment 2(1):407–418

    Article  Google Scholar 

  18. Fan X, Wang J, Pu X, Zhou L, Lv B (2011) On graph-based name disambiguation. Journal of Data and Information Quality (JDIQ) 2(2):10

    Google Scholar 

  19. Fogel L, Owens A, Walsh M (1975) Adaptation in natural and artificial systems

  20. Gadia SK (1988) The role of temporal elements in temporal databases. IEEE Data Eng Bull 11(4):19–25

    MathSciNet  Google Scholar 

  21. Golberg DE (1989) Genetic algorithms in search, optimization, and machine learning, Addion wesley 1989

  22. Goldberg DE et al (1989) Genetic algorithms in search, optimization, and machine learning, vol 412, Addison-wesley Reading Menlo Park

  23. Hachani N, Ounelli H (2007) Improving cluster method quality by validity indices. Flairs Conference, pp 479–483

    Google Scholar 

  24. Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. Journal of Intelligent Information Systems 17(2-3):107–145

    Article  MATH  Google Scholar 

  25. Hartl RF, Belew R (1990) A global convergence proof for a class of genetic algorithms. University of Technology, Vienna

    Google Scholar 

  26. Hazimeh H, Youness I, Makki J, Noureddine H, Tscherrig J, Mugellini E, Khaled OA (2016) Leveraging co-authorship and biographical information for author ambiguity resolution in dblp. 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA). IEEE , pp 1080–1084

  27. Hernández M A, Stolfo SJ (1995) The merge/purge problem for large databases. ACM SIGMOD Record, ACM, vol 24, pp 127–138

  28. Holland JH (1975) Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence, U Michigan Press

  29. Höppner F (1999) Fuzzy cluster analysis: methods for classification, data analysis and image recognition, Wiley

  30. Isele R, Bizer C (2012) Learning expressive linkage rules using genetic programming. Proceedings of the VLDB Endowment 5(11):1638–1649

    Article  Google Scholar 

  31. Jensen CS, Clifford J, Gadia SK, Segev A, Snodgrass RT (1992) A glossary of temporal database concepts. ACM Sigmod Record 21(3):35–43

    Article  Google Scholar 

  32. Jin H, Huang L, Yuan P (2009) Name disambiguation using semantic association clustering. IEEE International Conference on e-business engineering, 2009, ICEBE’09. IEEE, pp 42– 48

  33. Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment 3(1–2):484–493

    Article  Google Scholar 

  34. Köpcke H, Thor A, Rahm E (2010) Learning-based approaches for matching web data entities. IEEE Internet Computing 14(4): 23–31

    Article  Google Scholar 

  35. Kovács F, Legány C, Babos A (2005) Cluster validity measurement techniques. In: 6th International symposium of hungarian researchers on computational intelligence, Citeseer

  36. Li L, Li J, Gao H (2015) Rule-based method for entity resolution. IEEE Trans Knowl Data Eng 27 (1):250–263

    Article  Google Scholar 

  37. Li P, Dong XL, Maurino A, Srivastava D (2011) Linking temporal records. Proceedings of the VLDB Endowment 4(11):956– 967

    MATH  Google Scholar 

  38. Li P, Tziviskou C, Wang H, Dong XL, Liu X, Maurino A, Srivastava D (2012a) Chronos: Facilitating history discovery by linking temporal records. Proceedings of the VLDB Endowment 5(12):2006–2009

  39. Li S, Cong G, Miao C (2012b) Author name disambiguation using a new categorical distribution similarity. Machine learning and knowledge discovery in databases, Springer, pp 569– 584

  40. Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12):1650–1654

    Article  Google Scholar 

  41. Mishra S, Mondal S, Saha S (2013) Entity matching technique for bibliographic database. Database and Expert Systems Applications, Springer, pp 34–41

  42. Mishra S, Saha S, Mondal S (2014a) Cluster validation techniques for bibliographic databases. Students’ Technology Symposium (TechSym), 2014 IEEE. IEEE, pp 93–98

  43. Mishra S, Saha S, Mondal S (2014b) On validation of clustering techniques for bibliographic databases. 2014 22nd International Conference on Pattern Recognition (ICPR). IEEE, pp 3150–3155

  44. Nikolov A, Uren V, Motta E, De Roeck A (2008) Integration of semantically annotated data by the knofuss architecture International Conference on Knowledge Engineering and Knowledge Management. Springer, pp 265–274

  45. Nikolov A, DAquin M, Motta E (2012) Unsupervised learning of link discovery configuration. Extended Semantic Web Conference. Springer, pp 119–133

  46. Pal SK, Bhandari D (1994) Selection of optimal set of weights in a layered network using genetic algorithms. Inf Sci 80(3):213– 234

    Article  Google Scholar 

  47. Petermann A, Junghanns M, Müller R, Rahm E (2014) Foodbroker-generating synthetic datasets for graph-based business analytics. Workshop on Big Data Benchmarks, Springer, pp 145–155

    Google Scholar 

  48. Ribeiro Filho JL, Treleaven PC, Alippi C (1994) Genetic-algorithm programming environments. Computer 27(6):28–43

    Article  Google Scholar 

  49. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  MATH  Google Scholar 

  50. Rudolph G (1994) Convergence analysis of canonical genetic algorithms. IEEE transactions on neural networks 5(1):96– 101

    Article  Google Scholar 

  51. Sharapov RR, Lapshin AV (2006) Convergence of genetic algorithms. Pattern recognition and image analysis 16(3):392– 397

    Article  Google Scholar 

  52. Srinivas M, Patnaik LM (1994) Adaptive probabilities of crossover and mutation in genetic algorithms. IEEE Transactions on systems, Man and Cybernetics 24(4):656–667

    Article  Google Scholar 

  53. Storn R, Price K (1997) Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J Glob Optim 11(4):341–359

    Article  MathSciNet  MATH  Google Scholar 

  54. Sun Y, Wu T, Yin Z, Cheng H, Han J, Yin X, Zhao P (2008) Bibnetminer: mining bibliographic information networks. Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, pp 1341– 1344

  55. Tadayon N, Wang H, Sharma B, Wang W, Hua K (2011) A cooperative transmission approach to reduce end-to-end delay in multi hop wireless ad-hoc networks. Global Telecommunications Conference (GLOBECOM 2011), 2011 IEEE. IEEE , pp 1–5

  56. Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z (2008) Arnetminer: extraction and mining of academic social networks. Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 990– 998

  57. Tang J, Fong ACM, Wang B, Zhang J (2012) A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering 24(6):975– 987

    Article  Google Scholar 

  58. Wang J, Li G, Yu JX, Feng J (2011a) Entity matching: How similar is similar. Proceedings of the VLDB Endowment 4(10):622– 633

  59. Wang W (2011) Relative enumerability and 1-genericity. The Journal of Symbolic Logic 76(03):897–913

    Article  MathSciNet  MATH  Google Scholar 

  60. Wang X, Tang J, Cheng H, Yu PS (2011b) Adana: Active name disambiguation. 2011 IEEE 11th International Conference on Data Mining (ICDM). IEEE, pp 794–803

  61. Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Transactions on pattern analysis and machine intelligence 13(8):841–847

    Article  Google Scholar 

  62. Yin X, Han J, Yu P (2007) Object distinction: Distinguishing objects with identical names. IEEE 23rd International Conference on Data Engineering, 2007, ICDE 2007. IEEE, pp 1242–1246

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sumit Mishra.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mishra, S., Saha, S. & Mondal, S. GAEMTBD: Genetic algorithm based entity matching techniques for bibliographic databases. Appl Intell 47, 197–230 (2017). https://doi.org/10.1007/s10489-016-0874-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-016-0874-z

Keywords

Navigation