Evaluating entity-description conflict on duplicated data

Li, Lingli; Li, Jianzhong; Gao, Hong

doi:10.1007/s10878-014-9801-6

Evaluating entity-description conflict on duplicated data

Published: 14 October 2014

Volume 31, pages 918–941, (2016)
Cite this article

Journal of Combinatorial Optimization Aims and scope Submit manuscript

Lingli Li¹,
Jianzhong Li¹ &
Hong Gao¹

247 Accesses
3 Citations
Explore all metrics

Abstract

Duplicated records, which describe the same entity in the real world, frequently generated by data integration. Ideally, the values on the same attributes of duplicated records should be identical. However, the duplicated records may have conflicting values on the same attributes due to ambiguity and data errors. Obviously, the more the conflicts there are among duplicated records in a data set, the poorer the quality of the data set is. To address the problem, we explore a new data quality measure, entity-description conflict, to evaluate the conflict on duplicated records. Since current entity resolution algorithms can hardly identify duplicated records correctly and completely, it brings challenges to compute the entity-description conflict. To this end, it is studied to compute the range of the entity-description conflict while the entity resolution result is not completely correct in this paper. (1) The mathematics model of the entity-description conflict is introduced. (2) Four primary operators for computing the range of the entity-description conflict are identified and are proved to be NP-hard, and thus it is proved that the problem of computing the range of the entity-description conflict is NP-hard. (3) Four approximation algorithms for the four primary operators are provided and a framework based on the four primary operators is proposed for computing the range of the entity-description conflict. (4) Using real-life data and synthetic data, the effectiveness and efficiency of the proposed algorithms are experimentally verified.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EntityManager: Managing Dirty Data Based on Entity Resolution

Article 12 May 2017

Entity Resolution in Big Data Era: Challenges and Applications

Resolving Entity on A Large scale: DEtermining Linked Entities and Grouping similar Attributes represented in assorted TErminologies

Article 09 September 2017

Notes

http://www.cs.utexas.edu/users/ml/riddle/data.html.

References

Arasu A, Chaudhuri S, Kaushik R (2008) Transformation-based framework for record matching. In: IEEE 24th international conference on data engineering, 2008. ICDE 2008. IEEE, pp 40–49 (2008)
Arasu A, Chaudhuri S, Kaushik R (2009) Learning string transformations from examples. Proc VLDB Endow 2(1):514–525
Article Google Scholar
Bansal N, Blum A, Chawla S (2004) Correlation clustering. Mach Learn 56(1–3):89–113
Article MATH Google Scholar
Berti-Equille L, Sarma AD, Marian A, Srivastava D et al (2009) Sailing the information ocean with awareness of currents: discovery and application of source dependence. arXiv preprint arXiv:0909.1776
Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. ACM Trans Knowl Discov Data (TKDD) 1(1):5
Article Google Scholar
Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 39–48
Bilenko M, Kamath B, Mooney RJ (2006) Adaptive blocking: Learning to scale up record linkage. In: Sixth international conference on data mining, 2006. ICDM’06. IEEE, pp 87–96
Bleiholder J, Naumann F (2006) Conflict handling strategies in an integrated information system. Mathematisch-Naturwissenschaftliche Fakultät II, Institut für Informatik, Humboldt-Universität zu Berlin, pp 1–13
Bleiholder J, Naumann F (2008) Data fusion. ACM Comput Surv (CSUR) 41(1):1
Article Google Scholar
Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data. ACM, pp 313–324
Chaudhuri S, Chen BC, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: Proceedings of the 33rd international conference on very large data bases. VLDB Endowment, pp 327–338
Cohen WW (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: ACM SIGMOD record, vol 27. ACM, pp 201–212
Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 475–480
Cormen TH, Leiserson CE, Rivest RL, Stein C et al (2001) Introduction to algorithms, vol 2. MIT press, Cambridge
MATH Google Scholar
Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data. ACM, pp 85–96
Dong XL, Berti-Equille L, Srivastava D (2009a) Truth discovery and copying detection from source update history. In: Technical report
Dong XL, Berti-Equille L, Srivastava D (2009b) Integrating conflicting data: the role of source dependence. Proc VLDB Endow 2(1):550–561
Fan X, Wang J, Pu X, Zhou L, Lv B (2011) On graph-based name disambiguation. J Data Inf Qual (JDIQ) 2(2):10
Google Scholar
Fisher CW, Lauría EJM Matheus CC (2007) In search of an accuracy metric. In: Proceedings of the 12th International Conference on Information Quality (ICIQ 2007), pp 379–392
Gravano L, Ipeirotis PG, Koudas N, Srivastava D (2003) Text joins in an rdbms for web data integration. In: Proceedings of the 12th international conference on World Wide Web. ACM, pp 90–101
Hassanzadeh O, Chiang F, Lee HC, Miller RJ (2009) Framework for evaluating clustering algorithms in duplicate detection. Proc VLDB Endow 2(1):1282–1293
Article Google Scholar
Hernández MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: ACM SIGMOD record, vol 24. ACM, pp 127–138
Jaro MA (1989) Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J Am Stat Assoc 84(406):414–420
Article Google Scholar
Koudas N, Sarawagi S, Srivastava D (2006) Record linkage: similarity measures and algorithms. In: Proceedings of the 2006 ACM SIGMOD international conference on management of data. ACM, pp 802–803
Newcombe HB, Kennedy JM, Axford SJ, James AP (1959) Automatic linkage of vital records. Science 130(3381):954–959
Pipino LL, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45(4):211–218
Article Google Scholar
Rastogi V, Dalvi N, Garofalakis M (2011) Large-scale collective entity matching. Proc VLDB Endow 4(4):208–218
Article Google Scholar
Redman TC (1998) The impact of poor data quality on the typical enterprise. Commun ACM 41(2):79–82
Article Google Scholar
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 269–278
Shu L, Long B, Meng W (2009) A latent topic model for complete entity resolution. In: IEEE 25th international conference on data engineering, 2009. ICDE’09. IEEE, pp 880–891 (2009)
Singla P, Domingos P (2005) Object identification with attribute-mediated dependences. In: Knowledge discovery in databases: PKDD 2005. Springer, pp 297–308
Tejada S, Knoblock CA, Minton S (2001) Learning object identification rules for information integration. Inf Syst 26(8):607–633
Article MATH Google Scholar
Verykios VS, Moustakides GV, Elfeky MG (2003) A Bayesian decision model for cost optimal record matching. VLDB J 12(1):28–40
Article Google Scholar
Wang RY, Storey VC, Firth CP (1995) A framework for analysis of data quality research. IEEE Trans Knowl Data Eng 7(4):623–640
Article Google Scholar
Whang SE, Garcia-Molina H (2010) Entity resolution with evolving rules. Proc VLDB Endow 3(1–2):1326–1337
Article Google Scholar
Whang SE, Garcia-Molina H (2012) Joint entity resolution. In: 2012 IEEE 28th international conference on data engineering (ICDE). IEEE, pp 294–305 (2012)
Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data. ACM, pp 219–232
Wu M, Marian A (2007) Corroborating answers from multiple web sources. In: Proceedings of the 10th International Workshop on Web and Databases (WebDB 2007), pp 1–6
Yin X, Han J, Yu PS (2008) Truth discovery with multiple conflicting information providers on the web. IEEE Trans Knowl Data Eng 20(6):796–808
Article Google Scholar

Download references

Acknowledgments

This paper was partially supported by NGFR 973 Grant 2012CB316200, NGFR 863 Grant 2012AA011004 and NSFC Grant 61472099.

Author information

Authors and Affiliations

Department of Computer Science, Harbin Institute of Technology, Harbin, China
Lingli Li, Jianzhong Li & Hong Gao

Authors

Lingli Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lingli Li.

Appendix

Theorem 5 MaxDec-Comp is a polynomial-time 2-approximation algorithm.

Proof

From Proposition 1, MaxDec-Comp runs in polynomial time.

Suppose the solution obtained by MaxDec-Comp is $c_1$, the optimal solution is $c^*_1$, and the costs of $c_1$ and $c^*_1$ are denoted by $cost({c_1})$ and $cost({c^*_1})$ respectively such that $cost({c_1})=edc(c)-edc(c\backslash {c_1})$ and $cost({c^*_1})=edc(c)-edc(c\backslash {c^*_1})$. Now we prove $2cost({c_1})\ge {cost({c^*_1})}$.

Because the records in $c_1$ have the top $b$ weights, we obtain

$$\begin{aligned} \sum _{r_i\in {c_1}}{w_i}\ge {\sum _{r_i\in {c^*_1}}{w_i}} \end{aligned}$$

(11)

By the definition of $edc$ and the definition of weight $w_i$ of each record $r_i$, the following formula can be derived.

$$\begin{aligned} \nonumber \sum _{r_i\in {c_1}}{w_i}&= \sum _{r_i\in {c_1}}{\sum _{r_j\in {c}}dis(r_i,r_j)}\\ \nonumber&= {\sum _{r_i\in {c_1}}{\sum _{r_j\in {c_1}}dis(r_i,r_j)} +\sum _{r_i\in {c_1}}{\sum _{r_j\in {c\backslash {c_1}}}dis(r_i,r_j)}}\\ \nonumber&\le \sum _{r_i\in {c_1}}{\sum _{r_j\in {c_1}}dis(r_i,r_j)} +2\sum _{r_i\in {c_1}}{\sum _{r_j\in {c\backslash {c_1}}}dis(r_i,r_j)}\\ \nonumber&= \sum _{r_i\in {c}}{\sum _{r_j\in {c}}dis(r_i,r_j)} -\sum _{r_i\in {c\backslash {c_1}}}{\sum _{r_j\in {c\backslash {c_1}}}dis(r_i,r_j)}\\ \nonumber&= 2\left( \sum _{r_i,r_{j}\in {c},i\le {j}}{dis(r_i,r_{j})} -\sum _{r_i,r_{j}\in {c\backslash {c_1}},i\le {j}}{dis(r_i,r_{j})}\right) \\&= 2edc(c)-2edc(c\backslash {c_1})=2cost({c_1}) \end{aligned}$$

(12)

Similarly, we have

$$\begin{aligned} \nonumber \sum _{r_i\in {c^*_1}}{w_i}&= \sum _{r_i\in {c^*_1}}{\sum _{r_j\in {c}}dis(r_i,r_j)}\\ \nonumber&= \sum _{r_i\in {c^*_1}}{\sum _{r_j\in {c^*_1}}dis(r_i,r_j)} +\sum _{r_i\in {c^*}}{\sum _{r_j\in {c\backslash {c^*_1}}}dis(r_i,r_j)}\\ \nonumber&= 2\sum _{r_i,r_{j}\in {c^*_1},i\le {j}}{dis(r_i,r_j)} +\sum _{r_i\in {c^*_1}}{\sum _{r_j\in {c\backslash {c^*_1}}}dis(r_i,r_j)}\\ \nonumber&\ge {\sum _{r_i,r_{j}\in {c^*_1},i\le {j}}{dis(r_i,r_j)} +\sum _{r_i\in {c^*_1}}{\sum _{r_j\in {c\backslash {c^*_1}}}dis(r_i,r_j)}}\\&= {\sum _{r_i,r_{j}\in {c},i\le {j}}{dis(r_i,r_j)} -\sum _{r_i\in {c\backslash {c^*_1}}}{\sum _{r_j\in {c\backslash {c^*_1}}}dis(r_i,r_j)}} =cost({c^*_1}) \end{aligned}$$

(13)

From inequalities (11), (12) and (13), it follows that

$$\begin{aligned} 2cost({c_1})\ge {\sum _{r_i\in {c_1}}{w_i}}\ge {\sum _{r_i\in {c^*_1}}{w_i}} \ge {cost({c^*_1})} \end{aligned}$$

This completes the proof of Theorem 5. $\square $

Before proving Theorem 7, the following lemma is proved.

Lemma 1

Given an instance of $\mathsf {MaxInc^*}$, suppose the optimal solution contains $\{x_i|1\le {i}\le {n}\}$ and $\{y_{ij}|1\le {i}<j\le {n}\}$. $\{x_i|1\le {i}\le {n}\}$ and $\{y_{ij}|1\le {i}<j\le {n}\}$ satisfy that $\forall {i,j}, i\ne {j}, y_{ij}=min\{x_i,x_j\}$.

Proof

From inequalities (7) and (8), it follows that

$$\begin{aligned} y_{ij}\le {min\{x_i,x_j\}}. \end{aligned}$$

(14)

From inequality (14), we have

$$\begin{aligned} y_{ij}\ge {x_i+x_j-1}. \end{aligned}$$

(15)

As $0\le {x_j}\le {1}$ for $1\le j\le n$, it implies that

$$\begin{aligned} {x_i+x_j-1}\le {min\{x_i,x_j\}} \end{aligned}$$

(16)

Combining inequalities (14)-(16) gives

$$\begin{aligned} {x_i+x_j-1}\le {y_{ij}}\le {min\{x_i,x_j\}}. \end{aligned}$$

As $\{y_{ij}|1\le {i}<j\le {n}\}$ is the optimal solution for $\mathsf {MaxInc}$, $\sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}$ is maximized. Thus $y_{ij}=min\{x_i,x_j\}$. $\square $

Theorem 7 MaxInc-Comp is a polynomial-time $\rho $ -approximation algorithm, where $\rho =\frac{(n-1)b}{b-1}$.

Proof

We have already shown that MaxInc-Comp runs in polynomial time.

Now we prove that MaxInc-Comp is a $\rho $-approximation algorithm. Suppose the optimal solution to $\mathsf {MaxInc^*}$ contains $\{x_i|1\le {i}\le {n}\}$ and $\{y_{ij}|1\le {i}<j\le {n}\}$. Then the cost of the optimal solution is $cost^*=\sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}$ which is the upper bound of the cost of the optimal solution to $\mathsf {MaxInc}$ denoted by $cost$. Suppose the solution obtained by MaxInc-Comp is $(U,F)$. The cost of this solution is $\sum _{(i,j)\in {F}}c_{ij}$.

Applying Lemma , for all $1\le {i}<j\le {n}$, $y_{ij}\le {\frac{x_i+x_j}{2}}$, we have,

$$\begin{aligned} {\sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}}\le {\sum _{1\le {i}<j\le {n}}{c_{ij} {\frac{x_i+x_j}{2}}}} =\sum _{i}{x_i \frac{\sum _{j\ne {i}}{c_{ij}}}{2}} \end{aligned}$$

(17)

Let $d_i={\frac{1}{2}} \sum _{j\ne {i}}{c_{ij}}$. Inequality (17) can be written as

$$\begin{aligned} \sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}\le {\sum _{i}{x_id_i}} \end{aligned}$$

(18)

Suppose $d_i$ is sorted in a descending order, and the order is $d_1,d_2,\ldots ,d_{n-1}$. As $\sum _{i}{x_i}=b$, the following inequality can be easily proved by contradiction.

$$\begin{aligned} \sum _{i}{x_i d_i}\le {\sum _{1\le {i}\le {b}}d_i} \end{aligned}$$

(19)

The combination of inequalities (18) and (19) gives

$$\begin{aligned} \sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}\le {\sum _{1\le {i}\le {b}}d_i} \end{aligned}$$

(20)

Because the algorithm sorts edges in $E$ in a descending order denoted by $e_{i_1j_1},e_{i_2j_2},\ldots ,e_{i_kj_k},\ldots $ and picks the edges which have larger weights and places their endpoints into $U$ until $|U|=b$. Let $E^{\prime }=\{e_{i_1j_1},e_{i_2j_2},\ldots ,e_{i_mj_m}\}$ be the edges which are picked during the loop of the algorithm. Since $E^{\prime }\subseteq {F}$ and $m\ge {\lfloor \frac{b}{2}\rfloor }\ge {\frac{b-1}{2}}$, we have

$$\begin{aligned} \sum _{(i,j)\in {E^{\prime }}}{c_{ij}}&\ge {\sum _{(i,j)\in {E^{\prime \prime }}}{c_{ij}}} \end{aligned}$$

(21)

$$\begin{aligned} \frac{\sum _{(i,j)\in {E^{\prime \prime }}}c_{ij}}{\lfloor \frac{b}{2}\rfloor }&\ge {\frac{\sum _{(i,j)\in {E^{\prime \prime }}}c_{ij}}{m}}\ge {\frac{\sum _{1\le {i} \le {b}}d_i}{\frac{b(n-1)}{2}}} \end{aligned}$$

(22)

From inequality (22), we have

$$\begin{aligned} \sum _{(i,j)\in {E^{\prime \prime }}}{c_{ij}}\ge {\frac{\sum _{1\le {i}\le {b}}d_i}{\frac{b (n-1)}{b-1}}} \end{aligned}$$

(23)

Combining inequalities (20) and (23) gives

$$\begin{aligned} \sum _{(i,j)\in {E^{\prime \prime }}}{c_{ij}}\ge {\frac{\sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}}{\frac{b (n-1)}{b-1}}} \end{aligned}$$

(24)

From inequalities (21) and (24), we have

$$\begin{aligned} \sum _{(i,j)\in {E^{\prime }}}{c_{ij}}\ge {\frac{\sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}}{\frac{b (n-1)}{b-1}}} \end{aligned}$$

(25)

which implies that

$$\begin{aligned} \sum _{1\le {i}<j\le {n}}{c_{ij}y_{ij}} \le {\rho \sum _{(i,j)\in {F}}c_{ij}} \end{aligned}$$

(26)

where $\rho =\frac{(n-1)b}{b-1}$. Hence MaxInc-Comp is a $\rho $-approximation algorithm. $\square $

Theorem 8 MinInc-Comp is a polynomial-time $\rho $ -approximation algorithm, where $\rho =\frac{(n-1)b}{b-1}$.

Proof

Obviously MinInc-Comp runs in polynomial time.

Now we prove that MinInc-Comp is a $\rho $-approximation algorithm. We denote the given $\mathsf {MinInc}$ instance as $I$. Suppose $I$ is encoded into a $\mathsf {MaxInc}$ instance denoted by $I^{\prime }$ and the relaxation IP instance of $I^{\prime }$ is denoted by $I^{\prime \prime }$ which is a $\mathsf {MaxInc^*}$ problem. Suppose the cost of the optimal solution to $I^{\prime }$ is $cost^{\prime }$ and the cost of the optimal solution to $I^{\prime \prime }$ is $cost^{\prime \prime }$. Applying Theorem 7, we have $\frac{cost^{\prime \prime }}{cost}\le {\rho }$. Since $cost^{\prime }=\sum _{(i,j)\in {F}}{(w-c_{ij})}$, the cost of the solution generated by MinInc-Comp denoted by $cost$ satisfies that $cost=\sum _{(i,j)\in {F}}{w}-cost^{\prime }$. We denote that $M=\sum _{(i,j)\in {F}}{w}$. Then we have,

$$\begin{aligned} cost=M-cost^{\prime }. \end{aligned}$$

(27)

Suppose the optimal solution to $I^{\prime \prime }$ involves $\{x_i|1\le {i}\le {n}\}$ and $\{y_{ij}|1\le {i}<j\le {n}\}$. Since $cost^{\prime \prime }=\sum _{1\le {i}<j\le {n}}{(w-c_{ij}) y_{ij}}$ is maximized and the cost $cost^*$ of the optimal solution to $I$ satisfies that $cost^*=\sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}$ is minimized. Then we have

$$\begin{aligned} cost^*=\sum _{(i,j)\in {F}}{w}-\sum _{1\le {i}<j\le {n}}{(w-c_{ij}) y_{ij}}=M-cost^{\prime \prime }. \end{aligned}$$

(28)

Combining the equalities (27) and (28) gives

$$\begin{aligned} \frac{cost}{cost^*}=\frac{M-cost^{\prime }}{M-cost^{\prime \prime }}. \end{aligned}$$

(29)

By Theorem 7, we have

$$\begin{aligned} cost^{\prime \prime }\le {\rho cost^{\prime }} \end{aligned}$$

(30)

Since $w=(\rho +1) \max \{c_{ij}\}$, the following formula can be obtained.

$$\begin{aligned} \nonumber M&= \sum _{(i,j)\in {F}}w\\ \nonumber&= (\rho +1) \sum _{(i,j)\in {F}}{\max \{c_{ij}\}}\\ \nonumber&\ge {(\rho +1) \sum _{(i,j)\in {F}}c_{ij}}\\&\ge {(\rho +1) cost^{\prime }}. \end{aligned}$$

(31)

From the inequalities (30) and (31), it can be achieved that

$$\begin{aligned} \frac{M-cost^{\prime }}{M-cost^{\prime \prime }}\le {\frac{M-cost^{\prime }}{M-\rho cost^{\prime }}}\le {\frac{(\rho +1) cost^{\prime }-cost^{\prime }}{(\rho +1) cost^{\prime }-\rho cost^{\prime }}}=\rho \end{aligned}$$

which implies that

$$\begin{aligned} \frac{M-cost^{\prime }}{M-cost^{\prime \prime }}\le {\rho }. \end{aligned}$$

(32)

Combining the inequalities (29) and (32) gives

$$\begin{aligned} \frac{cost}{cost^*}\le {\rho }. \end{aligned}$$

(33)

Hence MinInc-Comp is a $\rho $-approximation algorithm. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, L., Li, J. & Gao, H. Evaluating entity-description conflict on duplicated data. J Comb Optim 31, 918–941 (2016). https://doi.org/10.1007/s10878-014-9801-6

Download citation

Published: 14 October 2014
Issue Date: February 2016
DOI: https://doi.org/10.1007/s10878-014-9801-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating entity-description conflict on duplicated data

Abstract

Access this article

Similar content being viewed by others

EntityManager: Managing Dirty Data Based on Entity Resolution

Entity Resolution in Big Data Era: Challenges and Applications

Resolving Entity on A Large scale: DEtermining Linked Entities and Grouping similar Attributes represented in assorted TErminologies

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Proof

Lemma 1

Proof

Proof

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evaluating entity-description conflict on duplicated data

Abstract

Access this article

Similar content being viewed by others

EntityManager: Managing Dirty Data Based on Entity Resolution

Entity Resolution in Big Data Era: Challenges and Applications

Resolving Entity on A Large scale: DEtermining Linked Entities and Grouping similar Attributes represented in assorted TErminologies

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Proof

Lemma 1

Proof

Proof

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation