Skip to main content
Log in

Evaluating entity-description conflict on duplicated data

  • Published:
Journal of Combinatorial Optimization Aims and scope Submit manuscript

Abstract

Duplicated records, which describe the same entity in the real world, frequently generated by data integration. Ideally, the values on the same attributes of duplicated records should be identical. However, the duplicated records may have conflicting values on the same attributes due to ambiguity and data errors. Obviously, the more the conflicts there are among duplicated records in a data set, the poorer the quality of the data set is. To address the problem, we explore a new data quality measure, entity-description conflict, to evaluate the conflict on duplicated records. Since current entity resolution algorithms can hardly identify duplicated records correctly and completely, it brings challenges to compute the entity-description conflict. To this end, it is studied to compute the range of the entity-description conflict while the entity resolution result is not completely correct in this paper. (1) The mathematics model of the entity-description conflict is introduced. (2) Four primary operators for computing the range of the entity-description conflict are identified and are proved to be NP-hard, and thus it is proved that the problem of computing the range of the entity-description conflict is NP-hard. (3) Four approximation algorithms for the four primary operators are provided and a framework based on the four primary operators is proposed for computing the range of the entity-description conflict. (4) Using real-life data and synthetic data, the effectiveness and efficiency of the proposed algorithms are experimentally verified.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://www.cs.utexas.edu/users/ml/riddle/data.html.

References

  • Arasu A, Chaudhuri S, Kaushik R (2008) Transformation-based framework for record matching. In: IEEE 24th international conference on data engineering, 2008. ICDE 2008. IEEE, pp 40–49 (2008)

  • Arasu A, Chaudhuri S, Kaushik R (2009) Learning string transformations from examples. Proc VLDB Endow 2(1):514–525

    Article  Google Scholar 

  • Bansal N, Blum A, Chawla S (2004) Correlation clustering. Mach Learn 56(1–3):89–113

    Article  MATH  Google Scholar 

  • Berti-Equille L, Sarma AD, Marian A, Srivastava D et al (2009) Sailing the information ocean with awareness of currents: discovery and application of source dependence. arXiv preprint arXiv:0909.1776

  • Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. ACM Trans Knowl Discov Data (TKDD) 1(1):5

    Article  Google Scholar 

  • Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 39–48

  • Bilenko M, Kamath B, Mooney RJ (2006) Adaptive blocking: Learning to scale up record linkage. In: Sixth international conference on data mining, 2006. ICDM’06. IEEE, pp 87–96

  • Bleiholder J, Naumann F (2006) Conflict handling strategies in an integrated information system. Mathematisch-Naturwissenschaftliche Fakultät II, Institut für Informatik, Humboldt-Universität zu Berlin, pp 1–13

  • Bleiholder J, Naumann F (2008) Data fusion. ACM Comput Surv (CSUR) 41(1):1

    Article  Google Scholar 

  • Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data. ACM, pp 313–324

  • Chaudhuri S, Chen BC, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: Proceedings of the 33rd international conference on very large data bases. VLDB Endowment, pp 327–338

  • Cohen WW (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: ACM SIGMOD record, vol 27. ACM, pp 201–212

  • Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 475–480

  • Cormen TH, Leiserson CE, Rivest RL, Stein C et al (2001) Introduction to algorithms, vol 2. MIT press, Cambridge

    MATH  Google Scholar 

  • Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data. ACM, pp 85–96

  • Dong XL, Berti-Equille L, Srivastava D (2009a) Truth discovery and copying detection from source update history. In: Technical report

  • Dong XL, Berti-Equille L, Srivastava D (2009b) Integrating conflicting data: the role of source dependence. Proc VLDB Endow 2(1):550–561

  • Fan X, Wang J, Pu X, Zhou L, Lv B (2011) On graph-based name disambiguation. J Data Inf Qual (JDIQ) 2(2):10

    Google Scholar 

  • Fisher CW, Lauría EJM Matheus CC (2007) In search of an accuracy metric. In: Proceedings of the 12th International Conference on Information Quality (ICIQ 2007), pp 379–392

  • Gravano L, Ipeirotis PG, Koudas N, Srivastava D (2003) Text joins in an rdbms for web data integration. In: Proceedings of the 12th international conference on World Wide Web. ACM, pp 90–101

  • Hassanzadeh O, Chiang F, Lee HC, Miller RJ (2009) Framework for evaluating clustering algorithms in duplicate detection. Proc VLDB Endow 2(1):1282–1293

    Article  Google Scholar 

  • Hernández MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: ACM SIGMOD record, vol 24. ACM, pp 127–138

  • Jaro MA (1989) Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J Am Stat Assoc 84(406):414–420

    Article  Google Scholar 

  • Koudas N, Sarawagi S, Srivastava D (2006) Record linkage: similarity measures and algorithms. In: Proceedings of the 2006 ACM SIGMOD international conference on management of data. ACM, pp 802–803

  • Newcombe HB, Kennedy JM, Axford SJ, James AP (1959) Automatic linkage of vital records. Science 130(3381):954–959

  • Pipino LL, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45(4):211–218

    Article  Google Scholar 

  • Rastogi V, Dalvi N, Garofalakis M (2011) Large-scale collective entity matching. Proc VLDB Endow 4(4):208–218

    Article  Google Scholar 

  • Redman TC (1998) The impact of poor data quality on the typical enterprise. Commun ACM 41(2):79–82

    Article  Google Scholar 

  • Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 269–278

  • Shu L, Long B, Meng W (2009) A latent topic model for complete entity resolution. In: IEEE 25th international conference on data engineering, 2009. ICDE’09. IEEE, pp 880–891 (2009)

  • Singla P, Domingos P (2005) Object identification with attribute-mediated dependences. In: Knowledge discovery in databases: PKDD 2005. Springer, pp 297–308

  • Tejada S, Knoblock CA, Minton S (2001) Learning object identification rules for information integration. Inf Syst 26(8):607–633

    Article  MATH  Google Scholar 

  • Verykios VS, Moustakides GV, Elfeky MG (2003) A Bayesian decision model for cost optimal record matching. VLDB J 12(1):28–40

    Article  Google Scholar 

  • Wang RY, Storey VC, Firth CP (1995) A framework for analysis of data quality research. IEEE Trans Knowl Data Eng 7(4):623–640

    Article  Google Scholar 

  • Whang SE, Garcia-Molina H (2010) Entity resolution with evolving rules. Proc VLDB Endow 3(1–2):1326–1337

    Article  Google Scholar 

  • Whang SE, Garcia-Molina H (2012) Joint entity resolution. In: 2012 IEEE 28th international conference on data engineering (ICDE). IEEE, pp 294–305 (2012)

  • Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data. ACM, pp 219–232

  • Wu M, Marian A (2007) Corroborating answers from multiple web sources. In: Proceedings of the 10th International Workshop on Web and Databases (WebDB 2007), pp 1–6

  • Yin X, Han J, Yu PS (2008) Truth discovery with multiple conflicting information providers on the web. IEEE Trans Knowl Data Eng 20(6):796–808

    Article  Google Scholar 

Download references

Acknowledgments

This paper was partially supported by NGFR 973 Grant 2012CB316200, NGFR 863 Grant 2012AA011004 and NSFC Grant 61472099.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lingli Li.

Appendix

Appendix

Theorem 5 MaxDec-Comp is a polynomial-time 2-approximation algorithm.

Proof

From Proposition 1, MaxDec-Comp runs in polynomial time.

Suppose the solution obtained by MaxDec-Comp is \(c_1\), the optimal solution is \(c^*_1\), and the costs of \(c_1\) and \(c^*_1\) are denoted by \(cost({c_1})\) and \(cost({c^*_1})\) respectively such that \(cost({c_1})=edc(c)-edc(c\backslash {c_1})\) and \(cost({c^*_1})=edc(c)-edc(c\backslash {c^*_1})\). Now we prove \(2cost({c_1})\ge {cost({c^*_1})}\).

Because the records in \(c_1\) have the top \(b\) weights, we obtain

$$\begin{aligned} \sum _{r_i\in {c_1}}{w_i}\ge {\sum _{r_i\in {c^*_1}}{w_i}} \end{aligned}$$
(11)

By the definition of \(edc\) and the definition of weight \(w_i\) of each record \(r_i\), the following formula can be derived.

$$\begin{aligned} \nonumber \sum _{r_i\in {c_1}}{w_i}&= \sum _{r_i\in {c_1}}{\sum _{r_j\in {c}}dis(r_i,r_j)}\\ \nonumber&= {\sum _{r_i\in {c_1}}{\sum _{r_j\in {c_1}}dis(r_i,r_j)} +\sum _{r_i\in {c_1}}{\sum _{r_j\in {c\backslash {c_1}}}dis(r_i,r_j)}}\\ \nonumber&\le \sum _{r_i\in {c_1}}{\sum _{r_j\in {c_1}}dis(r_i,r_j)} +2\sum _{r_i\in {c_1}}{\sum _{r_j\in {c\backslash {c_1}}}dis(r_i,r_j)}\\ \nonumber&= \sum _{r_i\in {c}}{\sum _{r_j\in {c}}dis(r_i,r_j)} -\sum _{r_i\in {c\backslash {c_1}}}{\sum _{r_j\in {c\backslash {c_1}}}dis(r_i,r_j)}\\ \nonumber&= 2\left( \sum _{r_i,r_{j}\in {c},i\le {j}}{dis(r_i,r_{j})} -\sum _{r_i,r_{j}\in {c\backslash {c_1}},i\le {j}}{dis(r_i,r_{j})}\right) \\&= 2edc(c)-2edc(c\backslash {c_1})=2cost({c_1}) \end{aligned}$$
(12)

Similarly, we have

$$\begin{aligned} \nonumber \sum _{r_i\in {c^*_1}}{w_i}&= \sum _{r_i\in {c^*_1}}{\sum _{r_j\in {c}}dis(r_i,r_j)}\\ \nonumber&= \sum _{r_i\in {c^*_1}}{\sum _{r_j\in {c^*_1}}dis(r_i,r_j)} +\sum _{r_i\in {c^*}}{\sum _{r_j\in {c\backslash {c^*_1}}}dis(r_i,r_j)}\\ \nonumber&= 2\sum _{r_i,r_{j}\in {c^*_1},i\le {j}}{dis(r_i,r_j)} +\sum _{r_i\in {c^*_1}}{\sum _{r_j\in {c\backslash {c^*_1}}}dis(r_i,r_j)}\\ \nonumber&\ge {\sum _{r_i,r_{j}\in {c^*_1},i\le {j}}{dis(r_i,r_j)} +\sum _{r_i\in {c^*_1}}{\sum _{r_j\in {c\backslash {c^*_1}}}dis(r_i,r_j)}}\\&= {\sum _{r_i,r_{j}\in {c},i\le {j}}{dis(r_i,r_j)} -\sum _{r_i\in {c\backslash {c^*_1}}}{\sum _{r_j\in {c\backslash {c^*_1}}}dis(r_i,r_j)}} =cost({c^*_1}) \end{aligned}$$
(13)

From inequalities (11), (12) and (13), it follows that

$$\begin{aligned} 2cost({c_1})\ge {\sum _{r_i\in {c_1}}{w_i}}\ge {\sum _{r_i\in {c^*_1}}{w_i}} \ge {cost({c^*_1})} \end{aligned}$$

This completes the proof of Theorem 5. \(\square \)

Before proving Theorem 7, the following lemma is proved.

Lemma 1

Given an instance of \(\mathsf {MaxInc^*}\), suppose the optimal solution contains \(\{x_i|1\le {i}\le {n}\}\) and \(\{y_{ij}|1\le {i}<j\le {n}\}\). \(\{x_i|1\le {i}\le {n}\}\) and \(\{y_{ij}|1\le {i}<j\le {n}\}\) satisfy that \(\forall {i,j}, i\ne {j}, y_{ij}=min\{x_i,x_j\}\).

Proof

From inequalities (7) and (8), it follows that

$$\begin{aligned} y_{ij}\le {min\{x_i,x_j\}}. \end{aligned}$$
(14)

From inequality (14), we have

$$\begin{aligned} y_{ij}\ge {x_i+x_j-1}. \end{aligned}$$
(15)

As \(0\le {x_j}\le {1}\) for \(1\le j\le n\), it implies that

$$\begin{aligned} {x_i+x_j-1}\le {min\{x_i,x_j\}} \end{aligned}$$
(16)

Combining inequalities (14)-(16) gives

$$\begin{aligned} {x_i+x_j-1}\le {y_{ij}}\le {min\{x_i,x_j\}}. \end{aligned}$$

As \(\{y_{ij}|1\le {i}<j\le {n}\}\) is the optimal solution for \(\mathsf {MaxInc}\), \(\sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}\) is maximized. Thus \(y_{ij}=min\{x_i,x_j\}\). \(\square \)

Theorem 7 MaxInc-Comp is a polynomial-time \(\rho \) -approximation algorithm, where \(\rho =\frac{(n-1)b}{b-1}\).

Proof

We have already shown that MaxInc-Comp runs in polynomial time.

Now we prove that MaxInc-Comp is a \(\rho \)-approximation algorithm. Suppose the optimal solution to \(\mathsf {MaxInc^*}\) contains \(\{x_i|1\le {i}\le {n}\}\) and \(\{y_{ij}|1\le {i}<j\le {n}\}\). Then the cost of the optimal solution is \(cost^*=\sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}\) which is the upper bound of the cost of the optimal solution to \(\mathsf {MaxInc}\) denoted by \(cost\). Suppose the solution obtained by MaxInc-Comp is \((U,F)\). The cost of this solution is \(\sum _{(i,j)\in {F}}c_{ij}\).

Applying Lemma , for all \(1\le {i}<j\le {n}\), \(y_{ij}\le {\frac{x_i+x_j}{2}}\), we have,

$$\begin{aligned} {\sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}}\le {\sum _{1\le {i}<j\le {n}}{c_{ij} {\frac{x_i+x_j}{2}}}} =\sum _{i}{x_i \frac{\sum _{j\ne {i}}{c_{ij}}}{2}} \end{aligned}$$
(17)

Let \(d_i={\frac{1}{2}} \sum _{j\ne {i}}{c_{ij}}\). Inequality (17) can be written as

$$\begin{aligned} \sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}\le {\sum _{i}{x_id_i}} \end{aligned}$$
(18)

Suppose \(d_i\) is sorted in a descending order, and the order is \(d_1,d_2,\ldots ,d_{n-1}\). As \(\sum _{i}{x_i}=b\), the following inequality can be easily proved by contradiction.

$$\begin{aligned} \sum _{i}{x_i d_i}\le {\sum _{1\le {i}\le {b}}d_i} \end{aligned}$$
(19)

The combination of inequalities (18) and (19) gives

$$\begin{aligned} \sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}\le {\sum _{1\le {i}\le {b}}d_i} \end{aligned}$$
(20)

Because the algorithm sorts edges in \(E\) in a descending order denoted by \(e_{i_1j_1},e_{i_2j_2},\ldots ,e_{i_kj_k},\ldots \) and picks the edges which have larger weights and places their endpoints into \(U\) until \(|U|=b\). Let \(E^{\prime }=\{e_{i_1j_1},e_{i_2j_2},\ldots ,e_{i_mj_m}\}\) be the edges which are picked during the loop of the algorithm. Since \(E^{\prime }\subseteq {F}\) and \(m\ge {\lfloor \frac{b}{2}\rfloor }\ge {\frac{b-1}{2}}\), we have

$$\begin{aligned} \sum _{(i,j)\in {E^{\prime }}}{c_{ij}}&\ge {\sum _{(i,j)\in {E^{\prime \prime }}}{c_{ij}}} \end{aligned}$$
(21)
$$\begin{aligned} \frac{\sum _{(i,j)\in {E^{\prime \prime }}}c_{ij}}{\lfloor \frac{b}{2}\rfloor }&\ge {\frac{\sum _{(i,j)\in {E^{\prime \prime }}}c_{ij}}{m}}\ge {\frac{\sum _{1\le {i} \le {b}}d_i}{\frac{b(n-1)}{2}}} \end{aligned}$$
(22)

From inequality (22), we have

$$\begin{aligned} \sum _{(i,j)\in {E^{\prime \prime }}}{c_{ij}}\ge {\frac{\sum _{1\le {i}\le {b}}d_i}{\frac{b (n-1)}{b-1}}} \end{aligned}$$
(23)

Combining inequalities (20) and (23) gives

$$\begin{aligned} \sum _{(i,j)\in {E^{\prime \prime }}}{c_{ij}}\ge {\frac{\sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}}{\frac{b (n-1)}{b-1}}} \end{aligned}$$
(24)

From inequalities (21) and (24), we have

$$\begin{aligned} \sum _{(i,j)\in {E^{\prime }}}{c_{ij}}\ge {\frac{\sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}}{\frac{b (n-1)}{b-1}}} \end{aligned}$$
(25)

which implies that

$$\begin{aligned} \sum _{1\le {i}<j\le {n}}{c_{ij}y_{ij}} \le {\rho \sum _{(i,j)\in {F}}c_{ij}} \end{aligned}$$
(26)

where \(\rho =\frac{(n-1)b}{b-1}\). Hence MaxInc-Comp is a \(\rho \)-approximation algorithm. \(\square \)

Theorem 8 MinInc-Comp is a polynomial-time \(\rho \) -approximation algorithm, where \(\rho =\frac{(n-1)b}{b-1}\).

Proof

Obviously MinInc-Comp runs in polynomial time.

Now we prove that MinInc-Comp is a \(\rho \)-approximation algorithm. We denote the given \(\mathsf {MinInc}\) instance as \(I\). Suppose \(I\) is encoded into a \(\mathsf {MaxInc}\) instance denoted by \(I^{\prime }\) and the relaxation IP instance of \(I^{\prime }\) is denoted by \(I^{\prime \prime }\) which is a \(\mathsf {MaxInc^*}\) problem. Suppose the cost of the optimal solution to \(I^{\prime }\) is \(cost^{\prime }\) and the cost of the optimal solution to \(I^{\prime \prime }\) is \(cost^{\prime \prime }\). Applying Theorem 7, we have \(\frac{cost^{\prime \prime }}{cost}\le {\rho }\). Since \(cost^{\prime }=\sum _{(i,j)\in {F}}{(w-c_{ij})}\), the cost of the solution generated by MinInc-Comp denoted by \(cost\) satisfies that \(cost=\sum _{(i,j)\in {F}}{w}-cost^{\prime }\). We denote that \(M=\sum _{(i,j)\in {F}}{w}\). Then we have,

$$\begin{aligned} cost=M-cost^{\prime }. \end{aligned}$$
(27)

Suppose the optimal solution to \(I^{\prime \prime }\) involves \(\{x_i|1\le {i}\le {n}\}\) and \(\{y_{ij}|1\le {i}<j\le {n}\}\). Since \(cost^{\prime \prime }=\sum _{1\le {i}<j\le {n}}{(w-c_{ij}) y_{ij}}\) is maximized and the cost \(cost^*\) of the optimal solution to \(I\) satisfies that \(cost^*=\sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}\) is minimized. Then we have

$$\begin{aligned} cost^*=\sum _{(i,j)\in {F}}{w}-\sum _{1\le {i}<j\le {n}}{(w-c_{ij}) y_{ij}}=M-cost^{\prime \prime }. \end{aligned}$$
(28)

Combining the equalities (27) and (28) gives

$$\begin{aligned} \frac{cost}{cost^*}=\frac{M-cost^{\prime }}{M-cost^{\prime \prime }}. \end{aligned}$$
(29)

By Theorem 7, we have

$$\begin{aligned} cost^{\prime \prime }\le {\rho cost^{\prime }} \end{aligned}$$
(30)

Since \(w=(\rho +1) \max \{c_{ij}\}\), the following formula can be obtained.

$$\begin{aligned} \nonumber M&= \sum _{(i,j)\in {F}}w\\ \nonumber&= (\rho +1) \sum _{(i,j)\in {F}}{\max \{c_{ij}\}}\\ \nonumber&\ge {(\rho +1) \sum _{(i,j)\in {F}}c_{ij}}\\&\ge {(\rho +1) cost^{\prime }}. \end{aligned}$$
(31)

From the inequalities (30) and (31), it can be achieved that

$$\begin{aligned} \frac{M-cost^{\prime }}{M-cost^{\prime \prime }}\le {\frac{M-cost^{\prime }}{M-\rho cost^{\prime }}}\le {\frac{(\rho +1) cost^{\prime }-cost^{\prime }}{(\rho +1) cost^{\prime }-\rho cost^{\prime }}}=\rho \end{aligned}$$

which implies that

$$\begin{aligned} \frac{M-cost^{\prime }}{M-cost^{\prime \prime }}\le {\rho }. \end{aligned}$$
(32)

Combining the inequalities (29) and (32) gives

$$\begin{aligned} \frac{cost}{cost^*}\le {\rho }. \end{aligned}$$
(33)

Hence MinInc-Comp is a \(\rho \)-approximation algorithm. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, L., Li, J. & Gao, H. Evaluating entity-description conflict on duplicated data. J Comb Optim 31, 918–941 (2016). https://doi.org/10.1007/s10878-014-9801-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10878-014-9801-6

Keywords

Navigation