Abstract
As companies gather and process more data from disparate sources, they are relying more heavily on entity resolution. Currently, creating an entity resolution system is a very procedural process. Blocking, transitive closure, and matching must all be pieced together whether by an Extract, Transform, and Load (ETL) tool or by a custom program (Galhardas et al. 2000). This is similar to the state of data querying before the advent of the Structured Query Language (SQL). In this chapter, a declarative approach to entity resolution is presented that gives the user the ability to specify what he or she would like resolved while allowing a code generator to determine the best way to resolve it. This chapter does not explore algorithms for blocking, transitive closure, clustering, or matching, but instead refers to papers on those subjects written by other authors (Baxter et al. 2003; Gu and Baxter 2004; Winkler 2000, 2003; Jaro 1989; Bhattacharya and Getoor 2006). Instead a background and defense of entity resolution and declarative languages is presented with a declarative solution and a possible representation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baxter R, Christen P, and Churches T (2003) A Comparison of Fast Blocking Methods for Record Linkage. Proceedings of the ACM SIGKDD’03 Workshop on Data Cleaning, Record Linakge, and Object Consolidation.
Benjelloun O, et al. (2006a) Swoosh: A Generic Approach to Entity Resolution. Stanford University Technical Report.
Benjelloun O, et al. (2006b) DSwoosh: A Family of Algorithms for Generic, Distributed Entity Resolution. Stanford University Technical Report.
Bhattacharya I, Getoor L (2005) Entity Resolution in Graphs. University of Maryland Technical Report CS-TR-4758.
Bhattacharya I, Getoor L (2006) Collective Entity Resolution in Relational Data. IEEE Data Engineering Bulletin, Special Issue on Data Quality, June 2006.
Cohen W, Richman J (2002) Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration. Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Fellegi I, Sunter A (1969) A Theory for Record Linkage. Journal of the American Statistical Association.
Galhardas H et al. (2000) An Extensible Framework for Data Cleansing. International Conference on Data Engineering.
Gu L, Baxter R (2004) Adaptive Filtering for Efficient Record Linkage. Proceedings of the Fourth SIAM International Conference on Data Mining.
Hernandez M, Stolfo S (1995) The merge/purge problem for large databases. Proceedings of the 1995 ACM SIGMOD International Conference on Data Engineering.
Jaro M (1989) Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association.
Jin L, Li C, Mehrotra S (2003) Efficient Record Linkage in Large Data Sets. Proceedings of the 8th International Conference on Database Systems for Advanced Applications.
McCallum A, Nigum K, Unger L (2000) Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. Knowledge Discovery and Data Mining.
Navarro G (2001) A Guided Tour to Approximate String Matching. ACM Computing Surveys.
Singla P, Domingos P (2006) Entity Resolution with Markov Logic. Proceedings of the Sixth International Conference on Data Mining.
Spring (2008) retrieved 2008 from http://www.springframework.org.
Winkler W (1988) Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage. Proceedings of the Section on Survey Research Methods, American Statistical Association.
Winkler W (1990) String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. Proceedings of the Section on Survey Research Methods, American Statistical Association.
Winkler W (2000) Machine Learning, Information Retrieval, and Record Linkage. Proceedings of the Section on Survey Research Methods, American Statistical Association.
Winkler W (2003) Data Cleaning Methods. Proceedings of the ACM Workshop on Data Cleaning, Record Linkage, and Object Identification.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Gibbs, T.H. (2009). A Declarative Approach to Entity Resolution. In: Chan, Y., Talburt, J., Talley, T. (eds) Data Engineering. International Series in Operations Research & Management Science, vol 132. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-0176-7_2
Download citation
DOI: https://doi.org/10.1007/978-1-4419-0176-7_2
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-0175-0
Online ISBN: 978-1-4419-0176-7
eBook Packages: Computer ScienceComputer Science (R0)