Abstract
There are different kinds of information present in a data set that can be utilized for record linkage activities: attributes, context, relationships, etc. In this chapter, we focus on techniques that enable record linkage in so-called complex context, which includes data sets with hierarchial relations, data sets that contain temporal information, and data sets that are extracted from the Web. For each method, we describe the problem to be solved and use a motivating example to demonstrate the challenges and intuitions of the work. We then present an overview of the approaches, followed by more detailed explanation of some key ideas, together with examples.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Koudas N, Sarawagi S, Srivastava D (2006) Record linkage: similarity measures and algorithms. In: Proceedings of the ACM SIGMOD international conference on management of data, Chicago, 27–29 June 2006
Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
Dey D (2008) Entity matching in heterogeneous databases: a logistic regression approach. Decis Support Syst 44:740–747
Hernandez MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, San Jose, 22–25 May 1995
Hassanzadeh O, Chiang F, Lee HC, Miller RJ (2009) Framework for evaluating clustering algorithms in duplicate detection. In: Proceedings of 35th international conference on very large data bases (VLDB 2009), Lyon, 24–28 August 2009
Wijaya DT, Bressan S (2009) Ricochet: a family of unconstrained algorithms for graph clustering. In: Proceedings of the 14th international conference on database systems for advanced applications (DASFAA 2009). Lecture Notes in Computer Science, vol. 5463, Brisbane, 21–23 April 2009. Springer, Berlin
Bansal N, Blum A, Chawla S (2002) Correlation clustering. In: Proceedings of the 43rd annual IEEE symposium on foundations of computer science (FOCS’02), Vancouver, 16–19 November 2002
van Dongen S (2000) Graph clustering by flow simulation. PhD thesis, University of Utrecht
Flake G, Tarjan R, Tsioutsiouliklis K (2004) Graph clustering and minimum cut trees. Internet Math 1:385–408
Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouses. In: Proceedings of 28th international conference on very large data bases (VLDB 2002), Hong Kong, 20–23 August 2002
Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: Proceedings 9th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, Maison de la Chimie, Paris, 13 June 2004
Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD’05), Baltimore, 13–16 June 2005
Chen Z, Kalashnikov DV, Mehrotra S (2005) Exploiting relationships for object consolidation. In: Proceedings of the 2nd international ACM SIGMOD workshop on Information quality in information systems (IQIS’05), Baltimore, 17 June 2005
Malin B (2005) Unsupervised name disambiguation via social network similarity. In: Proceedings of workshop on link analysis, counterterrorism, and security, Newport Beach, 23 April 2005
Lee T, Wang Z, Wang H, Hwang S-W (2011) Web scale taxonomy cleansing. In: Proceedings of 37th international conference on very large data bases (VLDB 2011), vol. 4, no. 12, Seattle, 29 August–3 September 2011, pp 1295–1306
Yoshida M, Ikeda M, Ono S, Sato I, Nakagawa H (2010) Person name disambiguation by bootstrapping. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR’10), Geneva, 19–23 July 2010
Li P, Dong XL, Maurino A, Srivastava D (2011) Linking temporal records. In: Proceedings of 37th international conference on very large data bases (VLDB 2011), vol. 4, no. 11, Seattle, 29 August–3 September 2011, pp 956–967
Sehgal V, Getoor L, Viechnicki PD (2006) Entity resolution in geospatial data integration. In: Proceedings of the 14th ACM international symposium on advances in geographic information systems (GIS’06), Arlington, 10–11 November 2006
Councill IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles CL (2006) Learning metadata from the evidence in an on-line citation matching scheme. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries (JCDL’06), Chapel Hill, 11–15 June 2006
Bhattacharya I, Getoor L (2005) Relational clustering for multi-type entity resolution. In: Proceedings of the 4th international workshop on multi-relational mining (MRDM’05), Chicago, 21 August 2005
McCallum AK, Wellner B (2003) Toward conditional models of identity uncertainty with application to proper noun coreference. In: Proceedings of IJCAI-03 workshop on information integration on the web (IIWeb-03), Acapulco, 9–10 August 2003
Domingos P (2004) Multi-relational record linkage. In: Proceedings of the KDD-2004 workshop on multi-relational data mining, Seattle, 22 August 2004
Kalashnikov DV, Mehrotra S, Chen Z (2005) Exploiting relationships for domain-independent data cleaning. In: Proceedings of 2005 SIAM international conference on data mining (SIAM SDM’05), Newport Beach, 21–23 April 2005
Artiles J, Gonzalo J, Sekine S (2007) The semeval-2007 weps evaluation: establishing a benchmark for the web people search task. In: Proceedings of SemEval-2007: 4th international workshop on semantic evaluations, Prague, 23–30 June 2007
Artiles J, Gonzalo J, Sekine S (2009) Weps 2 evaluation campaign: overview of the web people search clustering task. In: Proceedings of WePS-2 second web people search evaluation workshop, Madrid, 21 April 2009
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Li, P., Maurino, A. (2013). Linking Records in Complex Context. In: Sadiq, S. (eds) Handbook of Data Quality. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36257-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-36257-6_10
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36256-9
Online ISBN: 978-3-642-36257-6
eBook Packages: Computer ScienceComputer Science (R0)