Resolving Entity on A Large scale: DEtermining Linked Entities and Grouping similar Attributes represented in assorted TErminologies

Vidhya, K. A.; Geetha, T. V.

doi:10.1007/s10619-017-7205-1

Resolving Entity on A Large scale: DEtermining Linked Entities and Grouping similar Attributes represented in assorted TErminologies

Published: 09 September 2017

Volume 35, pages 303–332, (2017)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

2 Citations
Explore all metrics

Abstract

The tremendous growth of the World Wide Web (WWW) accumulates and exposes an abundance of unresolved real-world entities that are exposed to public Web databases. Entity resolution (ER) is the vital prerequisite for leveraging and resolving Web entities that describe the same real-world objects. Data blocking is a popular method for addressing Web entities and grouping similar entity profiles without duplication. The existing ER techniques apply hierarchical blocking to ease dimensionality reduction. Canopy clustering is a pre-clustering method for increasing processing speed. However, it performs a pairwise comparison of the entities, which results in a computationally intensive process. Moreover, conventional data-blocking techniques have limited control over both the block size and overlapping blocks, despite the significance of blocking quality in many potential applications. This paper proposes a Real-Delegate (Resolving Entity on A Large scale: DEtermining Linked Entities and Grouping similar Attributes represented in assorted TErminologies) that exploits attribute-based unsupervised hierarchical blocking as well as meta-blocking without relying on pre-clustering. The proposed approach significantly improves the efficiency of the blocking function in three phases. In the initial phase, the Real-Delegate approach links the multiple sets of equivalent entity descriptions using Linked Open Data (LOD) to integrate multiple Web sources. The next phase employs attribute-based unsupervised hierarchical blocking with rough set theory (RST), which considerably reduces superfluous comparisons. Finally, the Real-Delegate approach eliminates a redundant entity by employing a graph-based meta-blocking model that represents a redundancy-positive block and removes overlapping profiles effectively. The experimental results demonstrate that the proposed approach significantly improves the effectiveness of entity resolution compared with the token blocking method in a large-scale Web dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Entity Resolution in Big Data Era: Challenges and Applications

The Case for Holistic Data Integration

An effective weighted rule-based method for entity resolution

Article 02 August 2018

References

Dong, X.L., Srivastava, D.: Big data integration. IEEE 29th International Conference on Data Engineering (ICDE), pp. 1245–1248 (2013)
Stefanidis, K., Efthymiou, V., Herschel, M., Christophides, V.: Entity resolution in the Web of data, ACM Proceedings on WWW, pp. 203–204 (2014)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)
Article Google Scholar
Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer, J., et al. (eds.) Privacy in Statistical Databases, pp. 253–268. Springer, Berlin (2014)
Google Scholar
Zhu, S., Wang, D., Li, T.: Data clustering with size constraints. Knowl. Based Syst. 23(8), 883–889 (2010)
Article Google Scholar
Fisher, J., Christen, P., Wang, Q., Rahm, E.: A clustering-based framework to control block sizes for entity resolution, 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 279–288 (2015)
Christophides, Vassilis, Efthymiou, Vasilis, Stefanidis, Kostas: Entity resolution in the web of data. Synth. Lect. Semant. Web 5(3), 1–122 (2015)
Google Scholar
Kopcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)
Article Google Scholar
Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: To compare or not to compare: making entity resolution more efficient, ACM Proceedings of the International Workshop on Semantic Web Information Management, p. 3 (2011)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures, ACM Proceedings of the ninth SIGKDD international conference on Knowledge discovery and data mining, pp. 39–48 (2003)
De Assis Costa, G., de Oliveira, J.M.P.: A relational learning approach for collective entity resolution in the web of data, ACM Proceedings of the 5th International Conference on Consuming Linked Data, vol. 1264, pp. 13–24 (2014)
Li, C., Jin, L., Mehrotra, S.: Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web 9(4), 557–584 (2006). Springer
Article Google Scholar
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of highdimensional data sets with application to reference matching, ACM Proceedings of the Sixth SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178 (2000)
Ukkonen, E.: Approximate String Matching with q-grams and Maximal Matches, Theoretical Computer Science, vol. 92, pp. 191–1211. Elsevier Science Publishers Ltd., Essex (1992)
MATH Google Scholar
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration, ACM Proceedings of the \(12^{{\rm th}}\) International Conference on WWW, pp. 90–101 (2003)
Vries, T., Ke, H., Chawla, S., Christen, P.: Robust record linkage blocking using suffix arrays and Bloom filters, ACM Transactions on Knowledge Discovery from Data, vol. 5, No. 2 (2011)
Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking, ACM Proceedings of the SIGMOD International Conference on Management of data, pp. 219–232 (2009)
Shu, L., Chen, A., Xiong, M., Meng, W.: Efficient spectral neighborhood blocking for entity resolution, IEEE 27th International Conference on Data Engineering, pp. 1067–1078 (2011)
Sarma, Das A., Jain, A., Machanavajjhala, A., Bohannon, P.: An automatic blocking mechanism for large-scale de-duplication tasks, 21st ACM International Conference on Information and Knowledge Management, pp. 1055–1064 (2012)
Ramadan, B., Christen, P.: Unsupervised blocking key selection for real-time entity resolution, Springer International Publishing on Pacific-Asia Conference on Knowledge, pp. 574–585 (2015)
Chen, H.-L., Yang, B., Liu, J., Liu, D.-Y.: A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis. Expert Syst. Appl. 38(7), 9014–9022 (2011)
Article Google Scholar
Kaya, Y., Uyar, M.: A hybrid decision support system based on rough set and extreme learning machine for diagnosis of hepatitis disease. Appl. Soft Comput. 13(8), 3429–3438 (2013)
Article Google Scholar
Nin, J., Muntes-Mulero, V., Mart ınez-Bazan, N., Larriba-Pey, J.-L.: On the use of semantic blocking techniques for data cleansing and integration, IEEE 11th International Symposium on Database Engineering and Applications, pp. 190–198 (2007)
Papadakis, G., Ioannou, E., Niederee, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces, ACM Proceedings of the Fourth International Conference on Web Search and Data Mining, pp. 535–544 (2011)
Ma, Y., Tran, T.: Typimatch: type-specific unsupervised learning of keys and key values for heterogeneous Web data integration, Sixth ACM International Conference on Web Search and Data Mining, pp. 325–334 (2013)
Papadakis, G., Ioannou, E., Niederee, C., Palpanas, T., Nejdl, W.: Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data, ACM Proceedings of the Fifth International Conference on Web Search and Web Data Mining, pp. 53–62 (2012)
Kim, H.S., Lee, D.: HARRA: fast iterative hashed record linkage for large-scale data collections, ACM Proceedings of the 13th International Conference on Extending Database Technology, pp. 525–536 (2010)
Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: taking entity resolution to the next level. IEEE Trans. Knowl. Data Eng. 26(8), 1946–1960 (2014)
Article Google Scholar
Papadakis, G., Ioannou, E., Niederee, C., Palpanas, T., Nejdl, W.: Eliminating the redundancy in blocking-based entity resolution methods, ACM Proceedings of the 2011 Joint International Conference on Digital Libraries, pp. 85–94 (2011)
Papadakis, G., Papastefanatos, G., Palpanas, T., Koubarakis, M.: Scaling entity resolution to large, heterogeneous data with enhanced metablocking, In EDBT, pp. 221–232 (2016)
Papadakis, G., Papastefanatos, G., Koutrika, G.: Supervised metablocking. ACM proceedings of the VLDB 7(14), 1929–1940 (2014)
Article Google Scholar
Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., Nejdl, W.: A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25(12), 2665–2682 (2013)
Article Google Scholar
Efthymiou, V., Stefanidis, K., Christophides, V.: Benchmarking blocking algorithms for Web entities. IEEE Trans. Big Data (2016). doi:10.1109/TBDATA.2016.2576463
Efthymiou, V., Papadakis, G., Papastefanatos, G., Stefanidis, K., Palpanas, T.: Parallel meta-blocking: realizing scalable entity resolution over large, heterogeneous data, IEEE International Conference on Big data (Big data), pp. 411–420 (2015)
Efthymiou, V., Papadakis, G., Papastefanatos, G., Stefanidis, K., Palpanas, T.: Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Information Systems 65, 137–157 (2017)
Article Google Scholar
Papadakis, G., Demartini, G., Fankhauser, P., Kärger, P.: The missing links: discovering hidden same-as links among a billion of triples, ACM Proceedings of the 12th International Conference on Information Integration and Web-based Applications and Services, pp. 453–460 (2010)
Bizer, C., Heath, T., Berners-Lee, T.: Linked data—the story so far. In: Semantic Services, Interoperability and Web Applications: Emerging Concepts, pp. 205–227 (2009)
http://dbtune.org/bbc/peel/
Vidhya, K.A., Geetha, T.V.: Rough set theory for document clustering: a review. J. Intell. Fuzzy Syst. 32(3), 2165–2185 (2017)
Article Google Scholar
Vidhya, K.A., Geetha, T.V., Aghila, G.: Text document classification using Rough Set theory and Multi-level Naïve Bayes. Int. J. Appl. Eng. Res. 10(75), 331–336 (2015). (IJAER)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science Engineering, Anna University, Chennai, India
K. A. Vidhya & T. V. Geetha

Authors

K. A. Vidhya
View author publications
You can also search for this author in PubMed Google Scholar
T. V. Geetha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. A. Vidhya.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vidhya, K.A., Geetha, T.V. Resolving Entity on A Large scale: DEtermining Linked Entities and Grouping similar Attributes represented in assorted TErminologies. Distrib Parallel Databases 35, 303–332 (2017). https://doi.org/10.1007/s10619-017-7205-1

Download citation

Published: 09 September 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10619-017-7205-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Resolving Entity on A Large scale: DEtermining Linked Entities and Grouping similar Attributes represented in assorted TErminologies

Abstract

Access this article

Similar content being viewed by others

Entity Resolution in Big Data Era: Challenges and Applications

The Case for Holistic Data Integration

An effective weighted rule-based method for entity resolution

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Resolving Entity on A Large scale: DEtermining Linked Entities and Grouping similar Attributes represented in assorted TErminologies

Abstract

Access this article

Similar content being viewed by others

Entity Resolution in Big Data Era: Challenges and Applications

The Case for Holistic Data Integration

An effective weighted rule-based method for entity resolution

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation