Skip to main content

Entity Resolution

  • Living reference work entry
  • First Online:
Encyclopedia of Machine Learning and Data Mining

Abstract

References to real-world entities are often ambiguous, more commonly across data sources but frequently within a single data source as well. Ambiguities occur due to multiple reasons, such as incorrect data entry, or multiple possible representations of the entities. Given such a collection of ambiguous entity references, the goal of entity resolution is to discover the unique set of underlying entities, and map each reference to its corresponding entity. Resolving such entity ambiguities is necessary for removing redundancy and also for accurate entity-level analysis. This is a common problem that comes up in many different applications and has been studied in different branches of computer science. As evidences for entity resolution, traditional approaches consider pair-wise similarity between references, and many sophisticated similarity measures have been proposed to compare attributes of references. The simplest solution classifies reference pairs with similarity above a threshold as referring to the same entity. More sophisticated solutions use a probabilistic framework for reasoning with the pair-wise probabilities. Recently proposed relational approaches for entity resolution make use of relationships between references when available as additional evidences. Instead of reasoning independently for each pair of references, these approaches reason collectively over related pair-wise decisions over references. One line of work within the relational family uses supervised or unsupervised probabilistic learning using probabilistic graphical models, while another uses more scalable greedy techniques for merging references in a hyper-graph. Beyond improving entity resolution accuracy, such relational approaches yield additional knowledge in the form of relationships between the underlying entities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Recommended Reading

  • Bhattacharya I, Getoor L (2006) A latent dirichlet model for unsupervised entity resolution. In: The SIAM international conference on data mining (SIAM-SDM), Bethesda

    Google Scholar 

  • Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. ACM Trans Knowl Discov Data 1(1):5

    Article  Google Scholar 

  • Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2003), Washington, DC

    Google Scholar 

  • Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data, San Diego, pp 313–324

    Google Scholar 

  • Cohen WW, Ravikumar P, Fienberg SE (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI-2003 workshop on information integration on the web, Acapulco, pp 73–78

    Google Scholar 

  • Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: The ACM international conference on management of data (SIGMOD), Baltimore

    Google Scholar 

  • Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64:1183–1210

    Article  Google Scholar 

  • Gravano L, Ipeirotis P, Koudas N, Srivastava D (2003) Text joins for data cleansing and integration in an rdbms. In: 19th IEEE international conference on data engineering, Bangalore

    Google Scholar 

  • Hernández MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data (SIGMOD-95), San Jose, pp 127–138

    Google Scholar 

  • Kalashnikov DV, Mehrotra S, Chen Z (2005) Exploiting relationships for domain-independent data cleaning. In: SIAM international conference on data mining (SIAM SDM), Newport Beach, 21–23 Apr 2005

    Google Scholar 

  • Li X, Morie P, Roth D (2005) Semantic integration in text: from ambiguous names to identifiable entities. AI Mag Spec Issue Semant Integr 26(1):45–58

    Google Scholar 

  • McCallum A, Wellner B (2004) Conditional models of identity uncertainty with application to noun coreference. In: NIPS, Vancouver

    Google Scholar 

  • Menestrina D, Benjelloun O, Garcia-Molina H (2006) Generic entity resolution with data confidences. In: First Int’l VLDB workshop on clean databases, Seoul

    Google Scholar 

  • Monge AE, Elkan CP (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD 1997 workshop on research issues on data mining and knowledge discovery, Tuscon, pp 23–29

    Google Scholar 

  • Pasula H, Marthi B, Milch B, Russell S, Shpitser I (2003) Identity uncertainty and citation matching. In: Advances in neural information processing systems 15, Vancouver. MIT, Cambridge

    Google Scholar 

  • Singla P, Domingos P (2004) Multi-relational record linkage. In: Proceedings of 3rd workshop on multi-relational data mining at ACM SI GKDD, Seattle

    Google Scholar 

  • Winkler WE (2002) Methods for record linkage and Bayesian networks. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Indrajit Bhattacharya .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this entry

Cite this entry

Bhattacharya, I., Getoor, L. (2014). Entity Resolution. In: Sammut, C., Webb, G. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7502-7_81-1

Download citation

  • DOI: https://doi.org/10.1007/978-1-4899-7502-7_81-1

  • Received:

  • Accepted:

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Online ISBN: 978-1-4899-7502-7

  • eBook Packages: Springer Reference Computer SciencesReference Module Computer Science and Engineering

Publish with us

Policies and ethics