Advertisement

Knowledge and Information Systems

, Volume 46, Issue 2, pp 285–314 | Cite as

Efficient entity resolution based on subgraph cohesion

  • Hongzhi WangEmail author
  • Jianzhong Li
  • Hong Gao
Regular Paper

Abstract

Entity resolution has wide applications and receives considerable attentions in literature. For entity resolution, similarity functions are often used to judge whether two data objects refer to the same real-world entity. However, the similar relations determined by many commonly used similarity functions lack transitivity. This fact results in the conflict that \(A\) and \(B\) refer to the same entity and \(B\) and \(C\) refer to the same entity, but \(A\) and \(C\) do not refer to the same entity. To address this problem and make the group-wise entity resolution results consistent with pairwise entity resolution, this paper models the entity resolution problem as the partition of the vertices in a weighted graph into cohesive subgraphs, which is proven to be co-NP-complete. To solve this problem, an approximate algorithm with approximation ratio bound is proposed. For performing entity resolution on a large data set efficiently, a heuristic algorithm is developed to address this problem. In order to implement the heuristic algorithm efficiently, a similarity measure compatible with many measures in common usage is presented. With such similarity measure, indices and efficient implementations for the heuristic algorithm are proposed. Extensive experiments have been performed to verify the efficiency and effectiveness of the methods in this paper.

Keywords

Entity resolution Data quality Graph cohesion 

Notes

Acknowledgments

This paper was partially supported by NGFR 973 Grant 2012CB316200, NSFC Grant 61472099,61133002 and National Sci-Tech Support Plan 2015BAH10F00.

References

  1. 1.
    Duplicate detection, record linkage, and identity uncertainty: datasets. http://www.cs.utexas.edu/users/ml/riddle/data.html. Accessed 6 Oct 2013
  2. 2.
    Matching (graph theory) (2012). http://en.wikipedia.org/matching_(graph_theory). Accessed 15 Oct 2012
  3. 3.
    DBLP (2014). http://www.informatik.uni-trier.de/~ley/db/. Accessed 15 Jan 2014
  4. 4.
    Arasu A, Ré C, Suciu D (2009) Large-scale deduplication with constraints using dedupalog. In: ICDE, pp 952–963Google Scholar
  5. 5.
    Aslam JA, Pelekhov E, Rus D (2004) The star clustering algorithm for static and dynamic information organization. J Graph Algorithms Appl 8:95–129MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: WWW, pp 131–140Google Scholar
  7. 7.
    Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang SE, Widom J (2009) Swoosh: a generic approach to entity resolution. VLDB J 18(1):255–276CrossRefGoogle Scholar
  8. 8.
    Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: DMKD, pp 11–18Google Scholar
  9. 9.
    Chaudhuri S, Chen B-C, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: VLDB, pp 327–338Google Scholar
  10. 10.
    Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: ICDE, p 5Google Scholar
  11. 11.
    Chaudhuri S, Sarma AD, Ganti V, Kaushik R (2007) Leveraging aggregate constraints for deduplication. In: SIGMOD conference, pp 437–448Google Scholar
  12. 12.
    Chen Z, Kalashnikov DV, Mehrotra S (2009) Exploiting context analysis for combining multiple entity resolution systems. In: SIGMOD conference, pp 207–218Google Scholar
  13. 13.
    Dong X, Halevy AY, Madhavan J (2005) Reference reconciliation in complex information spaces. In: SIGMOD conference, pp 85–96Google Scholar
  14. 14.
    Duda R, Hart P, Stork D (2001) Pattern classification. Wiley, Hoboken, NJzbMATHGoogle Scholar
  15. 15.
    Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16CrossRefGoogle Scholar
  16. 16.
    Hassanzadeh O, Chiang F, Miller RJ, Lee HC (2009) Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1):1282–1293Google Scholar
  17. 17.
    Hassanzadeh O, Miller RJ (2009) Creating probabilistic databases from duplicated data. VLDB J 18(5):1141–1166CrossRefGoogle Scholar
  18. 18.
    Jiang Y, Li G, Feng J, Li W (2014) String similarity joins: an experimental evaluation. PVLDB 7(8):625–636Google Scholar
  19. 19.
    Kalashnikov DV, Mehrotra S (2006) Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans Database Syst 31(2):716–767CrossRefGoogle Scholar
  20. 20.
    Kim HS, Lee D (2010) Harra: fast iterative hashed record linkage for large-scale data collections. In: EDBT, pp 525–536Google Scholar
  21. 21.
    Koudas N, Saha A, Srivastava D, Venkatasubramanian S (2009) Metric functional dependencies. In: ICDE, pp 1275–1278Google Scholar
  22. 22.
    Manning CD, Schütze H (1999) Foundations of statistical natural language processing. The MIT Press, Cambridge, MAzbMATHGoogle Scholar
  23. 23.
    Menestrina D, Whang S, Garcia-Molina H (2010) Evaluating entity resolution results. PVLDB 3(1):208–219Google Scholar
  24. 24.
    Micali S, Vazirani VV (1980) An \(\text{ o }(\sqrt{|V|}|e|\)) algorithm for finding maximum matching in general graphs. In: FOCS, pp 17–27Google Scholar
  25. 25.
    Michelson M, Knoblock CA (2006) Learning blocking schemes for record linkage. In: AAAIGoogle Scholar
  26. 26.
    Michelson M, Knoblock CA (2009) Mining the heterogeneous transformations between data sources to aid record linkage. In: IC-AI, pp 422–428Google Scholar
  27. 27.
    Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: KDD, pp 269–278Google Scholar
  28. 28.
    Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27–64MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Shen W, DeRose P, Vu L, Doan A, Ramakrishnan R (2007) Source-aware entity matching: a compositional approach. In: ICDE, pp 196–205Google Scholar
  30. 30.
    Shen W, Li X, Doan A (2005) Constraint-based entity matching. In: AAAI, pp 862–867Google Scholar
  31. 31.
    Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: SIGMOD conference, pp 219–232Google Scholar
  32. 32.
    Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140Google Scholar
  33. 33.
    Yang X, Wang B, Li C (2008) Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD conference, pp 353–364Google Scholar
  34. 34.
    Yannakakis M (1978) Node- and edge-deletion np-complete problems. In: STOC, pp 253–264Google Scholar
  35. 35.
    Yin X, Han J, Yu PS (2007) Object distinction: distinguishing objects with identical names. In: ICDE, pp 1242–1246Google Scholar

Copyright information

© Springer-Verlag London 2015

Authors and Affiliations

  1. 1.Department of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations