Skip to main content

Linking Records in Complex Context

  • Chapter
  • First Online:
Handbook of Data Quality
  • 5046 Accesses

Abstract

There are different kinds of information present in a data set that can be utilized for record linkage activities: attributes, context, relationships, etc. In this chapter, we focus on techniques that enable record linkage in so-called complex context, which includes data sets with hierarchial relations, data sets that contain temporal information, and data sets that are extracted from the Web. For each method, we describe the problem to be solved and use a motivating example to demonstrate the challenges and intuitions of the work. We then present an overview of the approaches, followed by more detailed explanation of some key ideas, together with examples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16

    Article  Google Scholar 

  2. Koudas N, Sarawagi S, Srivastava D (2006) Record linkage: similarity measures and algorithms. In: Proceedings of the ACM SIGMOD international conference on management of data, Chicago, 27–29 June 2006

    Google Scholar 

  3. Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210

    Article  Google Scholar 

  4. Dey D (2008) Entity matching in heterogeneous databases: a logistic regression approach. Decis Support Syst 44:740–747

    Article  Google Scholar 

  5. Hernandez MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, San Jose, 22–25 May 1995

    Google Scholar 

  6. Hassanzadeh O, Chiang F, Lee HC, Miller RJ (2009) Framework for evaluating clustering algorithms in duplicate detection. In: Proceedings of 35th international conference on very large data bases (VLDB 2009), Lyon, 24–28 August 2009

    Google Scholar 

  7. Wijaya DT, Bressan S (2009) Ricochet: a family of unconstrained algorithms for graph clustering. In: Proceedings of the 14th international conference on database systems for advanced applications (DASFAA 2009). Lecture Notes in Computer Science, vol. 5463, Brisbane, 21–23 April 2009. Springer, Berlin

    Google Scholar 

  8. Bansal N, Blum A, Chawla S (2002) Correlation clustering. In: Proceedings of the 43rd annual IEEE symposium on foundations of computer science (FOCS’02), Vancouver, 16–19 November 2002

    Google Scholar 

  9. van Dongen S (2000) Graph clustering by flow simulation. PhD thesis, University of Utrecht

    Google Scholar 

  10. Flake G, Tarjan R, Tsioutsiouliklis K (2004) Graph clustering and minimum cut trees. Internet Math 1:385–408

    Article  MathSciNet  MATH  Google Scholar 

  11. Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouses. In: Proceedings of 28th international conference on very large data bases (VLDB 2002), Hong Kong, 20–23 August 2002

    Google Scholar 

  12. Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: Proceedings 9th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, Maison de la Chimie, Paris, 13 June 2004

    Google Scholar 

  13. Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD’05), Baltimore, 13–16 June 2005

    Google Scholar 

  14. Chen Z, Kalashnikov DV, Mehrotra S (2005) Exploiting relationships for object consolidation. In: Proceedings of the 2nd international ACM SIGMOD workshop on Information quality in information systems (IQIS’05), Baltimore, 17 June 2005

    Google Scholar 

  15. Malin B (2005) Unsupervised name disambiguation via social network similarity. In: Proceedings of workshop on link analysis, counterterrorism, and security, Newport Beach, 23 April 2005

    Google Scholar 

  16. Lee T, Wang Z, Wang H, Hwang S-W (2011) Web scale taxonomy cleansing. In: Proceedings of 37th international conference on very large data bases (VLDB 2011), vol. 4, no. 12, Seattle, 29 August–3 September 2011, pp 1295–1306

    Google Scholar 

  17. Yoshida M, Ikeda M, Ono S, Sato I, Nakagawa H (2010) Person name disambiguation by bootstrapping. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR’10), Geneva, 19–23 July 2010

    Google Scholar 

  18. Li P, Dong XL, Maurino A, Srivastava D (2011) Linking temporal records. In: Proceedings of 37th international conference on very large data bases (VLDB 2011), vol. 4, no. 11, Seattle, 29 August–3 September 2011, pp 956–967

    Google Scholar 

  19. Sehgal V, Getoor L, Viechnicki PD (2006) Entity resolution in geospatial data integration. In: Proceedings of the 14th ACM international symposium on advances in geographic information systems (GIS’06), Arlington, 10–11 November 2006

    Google Scholar 

  20. Councill IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles CL (2006) Learning metadata from the evidence in an on-line citation matching scheme. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries (JCDL’06), Chapel Hill, 11–15 June 2006

    Google Scholar 

  21. Bhattacharya I, Getoor L (2005) Relational clustering for multi-type entity resolution. In: Proceedings of the 4th international workshop on multi-relational mining (MRDM’05), Chicago, 21 August 2005

    Google Scholar 

  22. McCallum AK, Wellner B (2003) Toward conditional models of identity uncertainty with application to proper noun coreference. In: Proceedings of IJCAI-03 workshop on information integration on the web (IIWeb-03), Acapulco, 9–10 August 2003

    Google Scholar 

  23. Domingos P (2004) Multi-relational record linkage. In: Proceedings of the KDD-2004 workshop on multi-relational data mining, Seattle, 22 August 2004

    Google Scholar 

  24. Kalashnikov DV, Mehrotra S, Chen Z (2005) Exploiting relationships for domain-independent data cleaning. In: Proceedings of 2005 SIAM international conference on data mining (SIAM SDM’05), Newport Beach, 21–23 April 2005

    Google Scholar 

  25. Artiles J, Gonzalo J, Sekine S (2007) The semeval-2007 weps evaluation: establishing a benchmark for the web people search task. In: Proceedings of SemEval-2007: 4th international workshop on semantic evaluations, Prague, 23–30 June 2007

    Google Scholar 

  26. Artiles J, Gonzalo J, Sekine S (2009) Weps 2 evaluation campaign: overview of the web people search clustering task. In: Proceedings of WePS-2 second web people search evaluation workshop, Madrid, 21 April 2009

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pei Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Li, P., Maurino, A. (2013). Linking Records in Complex Context. In: Sadiq, S. (eds) Handbook of Data Quality. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36257-6_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36257-6_10

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36256-9

  • Online ISBN: 978-3-642-36257-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics