Linking Records in Complex Context

Li, Pei; Maurino, Andrea

doi:10.1007/978-3-642-36257-6_10

Pei Li² &
Andrea Maurino³

5046 Accesses

Abstract

There are different kinds of information present in a data set that can be utilized for record linkage activities: attributes, context, relationships, etc. In this chapter, we focus on techniques that enable record linkage in so-called complex context, which includes data sets with hierarchial relations, data sets that contain temporal information, and data sets that are extracted from the Web. For each method, we describe the problem to be solved and use a motivating example to demonstrate the challenges and intuitions of the work. We then present an overview of the approaches, followed by more detailed explanation of some key ideas, together with examples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Article Google Scholar
Koudas N, Sarawagi S, Srivastava D (2006) Record linkage: similarity measures and algorithms. In: Proceedings of the ACM SIGMOD international conference on management of data, Chicago, 27–29 June 2006
Google Scholar
Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
Article Google Scholar
Dey D (2008) Entity matching in heterogeneous databases: a logistic regression approach. Decis Support Syst 44:740–747
Article Google Scholar
Hernandez MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, San Jose, 22–25 May 1995
Google Scholar
Hassanzadeh O, Chiang F, Lee HC, Miller RJ (2009) Framework for evaluating clustering algorithms in duplicate detection. In: Proceedings of 35th international conference on very large data bases (VLDB 2009), Lyon, 24–28 August 2009
Google Scholar
Wijaya DT, Bressan S (2009) Ricochet: a family of unconstrained algorithms for graph clustering. In: Proceedings of the 14th international conference on database systems for advanced applications (DASFAA 2009). Lecture Notes in Computer Science, vol. 5463, Brisbane, 21–23 April 2009. Springer, Berlin
Google Scholar
Bansal N, Blum A, Chawla S (2002) Correlation clustering. In: Proceedings of the 43rd annual IEEE symposium on foundations of computer science (FOCS’02), Vancouver, 16–19 November 2002
Google Scholar
van Dongen S (2000) Graph clustering by flow simulation. PhD thesis, University of Utrecht
Google Scholar
Flake G, Tarjan R, Tsioutsiouliklis K (2004) Graph clustering and minimum cut trees. Internet Math 1:385–408
Article MathSciNet MATH Google Scholar
Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouses. In: Proceedings of 28th international conference on very large data bases (VLDB 2002), Hong Kong, 20–23 August 2002
Google Scholar
Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: Proceedings 9th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, Maison de la Chimie, Paris, 13 June 2004
Google Scholar
Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD’05), Baltimore, 13–16 June 2005
Google Scholar
Chen Z, Kalashnikov DV, Mehrotra S (2005) Exploiting relationships for object consolidation. In: Proceedings of the 2nd international ACM SIGMOD workshop on Information quality in information systems (IQIS’05), Baltimore, 17 June 2005
Google Scholar
Malin B (2005) Unsupervised name disambiguation via social network similarity. In: Proceedings of workshop on link analysis, counterterrorism, and security, Newport Beach, 23 April 2005
Google Scholar
Lee T, Wang Z, Wang H, Hwang S-W (2011) Web scale taxonomy cleansing. In: Proceedings of 37th international conference on very large data bases (VLDB 2011), vol. 4, no. 12, Seattle, 29 August–3 September 2011, pp 1295–1306
Google Scholar
Yoshida M, Ikeda M, Ono S, Sato I, Nakagawa H (2010) Person name disambiguation by bootstrapping. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR’10), Geneva, 19–23 July 2010
Google Scholar
Li P, Dong XL, Maurino A, Srivastava D (2011) Linking temporal records. In: Proceedings of 37th international conference on very large data bases (VLDB 2011), vol. 4, no. 11, Seattle, 29 August–3 September 2011, pp 956–967
Google Scholar
Sehgal V, Getoor L, Viechnicki PD (2006) Entity resolution in geospatial data integration. In: Proceedings of the 14th ACM international symposium on advances in geographic information systems (GIS’06), Arlington, 10–11 November 2006
Google Scholar
Councill IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles CL (2006) Learning metadata from the evidence in an on-line citation matching scheme. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries (JCDL’06), Chapel Hill, 11–15 June 2006
Google Scholar
Bhattacharya I, Getoor L (2005) Relational clustering for multi-type entity resolution. In: Proceedings of the 4th international workshop on multi-relational mining (MRDM’05), Chicago, 21 August 2005
Google Scholar
McCallum AK, Wellner B (2003) Toward conditional models of identity uncertainty with application to proper noun coreference. In: Proceedings of IJCAI-03 workshop on information integration on the web (IIWeb-03), Acapulco, 9–10 August 2003
Google Scholar
Domingos P (2004) Multi-relational record linkage. In: Proceedings of the KDD-2004 workshop on multi-relational data mining, Seattle, 22 August 2004
Google Scholar
Kalashnikov DV, Mehrotra S, Chen Z (2005) Exploiting relationships for domain-independent data cleaning. In: Proceedings of 2005 SIAM international conference on data mining (SIAM SDM’05), Newport Beach, 21–23 April 2005
Google Scholar
Artiles J, Gonzalo J, Sekine S (2007) The semeval-2007 weps evaluation: establishing a benchmark for the web people search task. In: Proceedings of SemEval-2007: 4th international workshop on semantic evaluations, Prague, 23–30 June 2007
Google Scholar
Artiles J, Gonzalo J, Sekine S (2009) Weps 2 evaluation campaign: overview of the web people search clustering task. In: Proceedings of WePS-2 second web people search evaluation workshop, Madrid, 21 April 2009
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, University of Zurich, Binzmuehlestrasse 14, CH-8050, Zurich, Switzerland
Pei Li
Department of Informatics, Systems and Communication, University of Milano, Bicocca, Viale Sarca 336/14, 20126, Milano, Italy
Andrea Maurino

Authors

Pei Li
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Maurino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pei Li .

Editor information

Editors and Affiliations

University of Queensland, Brisbane, Australia
Shazia Sadiq

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Li, P., Maurino, A. (2013). Linking Records in Complex Context. In: Sadiq, S. (eds) Handbook of Data Quality. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36257-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-36257-6_10
Published: 13 February 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36256-9
Online ISBN: 978-3-642-36257-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics