Advertisement

Journal of Combinatorial Optimization

, Volume 31, Issue 2, pp 918–941 | Cite as

Evaluating entity-description conflict on duplicated data

  • Lingli LiEmail author
  • Jianzhong Li
  • Hong Gao
Article

Abstract

Duplicated records, which describe the same entity in the real world, frequently generated by data integration. Ideally, the values on the same attributes of duplicated records should be identical. However, the duplicated records may have conflicting values on the same attributes due to ambiguity and data errors. Obviously, the more the conflicts there are among duplicated records in a data set, the poorer the quality of the data set is. To address the problem, we explore a new data quality measure, entity-description conflict, to evaluate the conflict on duplicated records. Since current entity resolution algorithms can hardly identify duplicated records correctly and completely, it brings challenges to compute the entity-description conflict. To this end, it is studied to compute the range of the entity-description conflict while the entity resolution result is not completely correct in this paper. (1) The mathematics model of the entity-description conflict is introduced. (2) Four primary operators for computing the range of the entity-description conflict are identified and are proved to be NP-hard, and thus it is proved that the problem of computing the range of the entity-description conflict is NP-hard. (3) Four approximation algorithms for the four primary operators are provided and a framework based on the four primary operators is proposed for computing the range of the entity-description conflict. (4) Using real-life data and synthetic data, the effectiveness and efficiency of the proposed algorithms are experimentally verified.

Keywords

Entity-description conflict Evaluation Data quality  Data integration 

Notes

Acknowledgments

This paper was partially supported by NGFR 973 Grant 2012CB316200, NGFR 863 Grant 2012AA011004 and NSFC Grant 61472099.

References

  1. Arasu A, Chaudhuri S, Kaushik R (2008) Transformation-based framework for record matching. In: IEEE 24th international conference on data engineering, 2008. ICDE 2008. IEEE, pp 40–49 (2008)Google Scholar
  2. Arasu A, Chaudhuri S, Kaushik R (2009) Learning string transformations from examples. Proc VLDB Endow 2(1):514–525CrossRefGoogle Scholar
  3. Bansal N, Blum A, Chawla S (2004) Correlation clustering. Mach Learn 56(1–3):89–113CrossRefzbMATHGoogle Scholar
  4. Berti-Equille L, Sarma AD, Marian A, Srivastava D et al (2009) Sailing the information ocean with awareness of currents: discovery and application of source dependence. arXiv preprint arXiv:0909.1776
  5. Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. ACM Trans Knowl Discov Data (TKDD) 1(1):5CrossRefGoogle Scholar
  6. Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 39–48Google Scholar
  7. Bilenko M, Kamath B, Mooney RJ (2006) Adaptive blocking: Learning to scale up record linkage. In: Sixth international conference on data mining, 2006. ICDM’06. IEEE, pp 87–96Google Scholar
  8. Bleiholder J, Naumann F (2006) Conflict handling strategies in an integrated information system. Mathematisch-Naturwissenschaftliche Fakultät II, Institut für Informatik, Humboldt-Universität zu Berlin, pp 1–13Google Scholar
  9. Bleiholder J, Naumann F (2008) Data fusion. ACM Comput Surv (CSUR) 41(1):1CrossRefGoogle Scholar
  10. Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data. ACM, pp 313–324Google Scholar
  11. Chaudhuri S, Chen BC, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: Proceedings of the 33rd international conference on very large data bases. VLDB Endowment, pp 327–338Google Scholar
  12. Cohen WW (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: ACM SIGMOD record, vol 27. ACM, pp 201–212Google Scholar
  13. Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 475–480Google Scholar
  14. Cormen TH, Leiserson CE, Rivest RL, Stein C et al (2001) Introduction to algorithms, vol 2. MIT press, CambridgezbMATHGoogle Scholar
  15. Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data. ACM, pp 85–96Google Scholar
  16. Dong XL, Berti-Equille L, Srivastava D (2009a) Truth discovery and copying detection from source update history. In: Technical reportGoogle Scholar
  17. Dong XL, Berti-Equille L, Srivastava D (2009b) Integrating conflicting data: the role of source dependence. Proc VLDB Endow 2(1):550–561Google Scholar
  18. Fan X, Wang J, Pu X, Zhou L, Lv B (2011) On graph-based name disambiguation. J Data Inf Qual (JDIQ) 2(2):10Google Scholar
  19. Fisher CW, Lauría EJM Matheus CC (2007) In search of an accuracy metric. In: Proceedings of the 12th International Conference on Information Quality (ICIQ 2007), pp 379–392Google Scholar
  20. Gravano L, Ipeirotis PG, Koudas N, Srivastava D (2003) Text joins in an rdbms for web data integration. In: Proceedings of the 12th international conference on World Wide Web. ACM, pp 90–101Google Scholar
  21. Hassanzadeh O, Chiang F, Lee HC, Miller RJ (2009) Framework for evaluating clustering algorithms in duplicate detection. Proc VLDB Endow 2(1):1282–1293CrossRefGoogle Scholar
  22. Hernández MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: ACM SIGMOD record, vol 24. ACM, pp 127–138Google Scholar
  23. Jaro MA (1989) Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J Am Stat Assoc 84(406):414–420CrossRefGoogle Scholar
  24. Koudas N, Sarawagi S, Srivastava D (2006) Record linkage: similarity measures and algorithms. In: Proceedings of the 2006 ACM SIGMOD international conference on management of data. ACM, pp 802–803Google Scholar
  25. Newcombe HB, Kennedy JM, Axford SJ, James AP (1959) Automatic linkage of vital records. Science 130(3381):954–959Google Scholar
  26. Pipino LL, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45(4):211–218CrossRefGoogle Scholar
  27. Rastogi V, Dalvi N, Garofalakis M (2011) Large-scale collective entity matching. Proc VLDB Endow 4(4):208–218CrossRefGoogle Scholar
  28. Redman TC (1998) The impact of poor data quality on the typical enterprise. Commun ACM 41(2):79–82CrossRefGoogle Scholar
  29. Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 269–278Google Scholar
  30. Shu L, Long B, Meng W (2009) A latent topic model for complete entity resolution. In: IEEE 25th international conference on data engineering, 2009. ICDE’09. IEEE, pp 880–891 (2009)Google Scholar
  31. Singla P, Domingos P (2005) Object identification with attribute-mediated dependences. In: Knowledge discovery in databases: PKDD 2005. Springer, pp 297–308Google Scholar
  32. Tejada S, Knoblock CA, Minton S (2001) Learning object identification rules for information integration. Inf Syst 26(8):607–633CrossRefzbMATHGoogle Scholar
  33. Verykios VS, Moustakides GV, Elfeky MG (2003) A Bayesian decision model for cost optimal record matching. VLDB J 12(1):28–40CrossRefGoogle Scholar
  34. Wang RY, Storey VC, Firth CP (1995) A framework for analysis of data quality research. IEEE Trans Knowl Data Eng 7(4):623–640CrossRefGoogle Scholar
  35. Whang SE, Garcia-Molina H (2010) Entity resolution with evolving rules. Proc VLDB Endow 3(1–2):1326–1337CrossRefGoogle Scholar
  36. Whang SE, Garcia-Molina H (2012) Joint entity resolution. In: 2012 IEEE 28th international conference on data engineering (ICDE). IEEE, pp 294–305 (2012)Google Scholar
  37. Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data. ACM, pp 219–232Google Scholar
  38. Wu M, Marian A (2007) Corroborating answers from multiple web sources. In: Proceedings of the 10th International Workshop on Web and Databases (WebDB 2007), pp 1–6Google Scholar
  39. Yin X, Han J, Yu PS (2008) Truth discovery with multiple conflicting information providers on the web. IEEE Trans Knowl Data Eng 20(6):796–808CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Department of Computer ScienceHarbin Institute of TechnologyHarbinChina

Personalised recommendations