Skip to main content
Log in

Resolving Entity on A Large scale: DEtermining Linked Entities and Grouping similar Attributes represented in assorted TErminologies

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

The tremendous growth of the World Wide Web (WWW) accumulates and exposes an abundance of unresolved real-world entities that are exposed to public Web databases. Entity resolution (ER) is the vital prerequisite for leveraging and resolving Web entities that describe the same real-world objects. Data blocking is a popular method for addressing Web entities and grouping similar entity profiles without duplication. The existing ER techniques apply hierarchical blocking to ease dimensionality reduction. Canopy clustering is a pre-clustering method for increasing processing speed. However, it performs a pairwise comparison of the entities, which results in a computationally intensive process. Moreover, conventional data-blocking techniques have limited control over both the block size and overlapping blocks, despite the significance of blocking quality in many potential applications. This paper proposes a Real-Delegate (Resolving Entity on A Large scale: DEtermining Linked Entities and Grouping similar Attributes represented in assorted TErminologies) that exploits attribute-based unsupervised hierarchical blocking as well as meta-blocking without relying on pre-clustering. The proposed approach significantly improves the efficiency of the blocking function in three phases. In the initial phase, the Real-Delegate approach links the multiple sets of equivalent entity descriptions using Linked Open Data (LOD) to integrate multiple Web sources. The next phase employs attribute-based unsupervised hierarchical blocking with rough set theory (RST), which considerably reduces superfluous comparisons. Finally, the Real-Delegate approach eliminates a redundant entity by employing a graph-based meta-blocking model that represents a redundancy-positive block and removes overlapping profiles effectively. The experimental results demonstrate that the proposed approach significantly improves the effectiveness of entity resolution compared with the token blocking method in a large-scale Web dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Dong, X.L., Srivastava, D.: Big data integration. IEEE 29th International Conference on Data Engineering (ICDE), pp. 1245–1248 (2013)

  2. Stefanidis, K., Efthymiou, V., Herschel, M., Christophides, V.: Entity resolution in the Web of data, ACM Proceedings on WWW, pp. 203–204 (2014)

  3. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)

    Article  Google Scholar 

  4. Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer, J., et al. (eds.) Privacy in Statistical Databases, pp. 253–268. Springer, Berlin (2014)

    Google Scholar 

  5. Zhu, S., Wang, D., Li, T.: Data clustering with size constraints. Knowl. Based Syst. 23(8), 883–889 (2010)

    Article  Google Scholar 

  6. Fisher, J., Christen, P., Wang, Q., Rahm, E.: A clustering-based framework to control block sizes for entity resolution, 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 279–288 (2015)

  7. Christophides, Vassilis, Efthymiou, Vasilis, Stefanidis, Kostas: Entity resolution in the web of data. Synth. Lect. Semant. Web 5(3), 1–122 (2015)

    Google Scholar 

  8. Kopcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)

    Article  Google Scholar 

  9. Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: To compare or not to compare: making entity resolution more efficient, ACM Proceedings of the International Workshop on Semantic Web Information Management, p. 3 (2011)

  10. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures, ACM Proceedings of the ninth SIGKDD international conference on Knowledge discovery and data mining, pp. 39–48 (2003)

  11. De Assis Costa, G., de Oliveira, J.M.P.: A relational learning approach for collective entity resolution in the web of data, ACM Proceedings of the 5th International Conference on Consuming Linked Data, vol. 1264, pp. 13–24 (2014)

  12. Li, C., Jin, L., Mehrotra, S.: Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web 9(4), 557–584 (2006). Springer

    Article  Google Scholar 

  13. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of highdimensional data sets with application to reference matching, ACM Proceedings of the Sixth SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178 (2000)

  14. Ukkonen, E.: Approximate String Matching with q-grams and Maximal Matches, Theoretical Computer Science, vol. 92, pp. 191–1211. Elsevier Science Publishers Ltd., Essex (1992)

    MATH  Google Scholar 

  15. Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration, ACM Proceedings of the \(12^{{\rm th}}\) International Conference on WWW, pp. 90–101 (2003)

  16. Vries, T., Ke, H., Chawla, S., Christen, P.: Robust record linkage blocking using suffix arrays and Bloom filters, ACM Transactions on Knowledge Discovery from Data, vol. 5, No. 2 (2011)

  17. Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking, ACM Proceedings of the SIGMOD International Conference on Management of data, pp. 219–232 (2009)

  18. Shu, L., Chen, A., Xiong, M., Meng, W.: Efficient spectral neighborhood blocking for entity resolution, IEEE 27th International Conference on Data Engineering, pp. 1067–1078 (2011)

  19. Sarma, Das A., Jain, A., Machanavajjhala, A., Bohannon, P.: An automatic blocking mechanism for large-scale de-duplication tasks, 21st ACM International Conference on Information and Knowledge Management, pp. 1055–1064 (2012)

  20. Ramadan, B., Christen, P.: Unsupervised blocking key selection for real-time entity resolution, Springer International Publishing on Pacific-Asia Conference on Knowledge, pp. 574–585 (2015)

  21. Chen, H.-L., Yang, B., Liu, J., Liu, D.-Y.: A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis. Expert Syst. Appl. 38(7), 9014–9022 (2011)

    Article  Google Scholar 

  22. Kaya, Y., Uyar, M.: A hybrid decision support system based on rough set and extreme learning machine for diagnosis of hepatitis disease. Appl. Soft Comput. 13(8), 3429–3438 (2013)

    Article  Google Scholar 

  23. Nin, J., Muntes-Mulero, V., Mart ınez-Bazan, N., Larriba-Pey, J.-L.: On the use of semantic blocking techniques for data cleansing and integration, IEEE 11th International Symposium on Database Engineering and Applications, pp. 190–198 (2007)

  24. Papadakis, G., Ioannou, E., Niederee, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces, ACM Proceedings of the Fourth International Conference on Web Search and Data Mining, pp. 535–544 (2011)

  25. Ma, Y., Tran, T.: Typimatch: type-specific unsupervised learning of keys and key values for heterogeneous Web data integration, Sixth ACM International Conference on Web Search and Data Mining, pp. 325–334 (2013)

  26. Papadakis, G., Ioannou, E., Niederee, C., Palpanas, T., Nejdl, W.: Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data, ACM Proceedings of the Fifth International Conference on Web Search and Web Data Mining, pp. 53–62 (2012)

  27. Kim, H.S., Lee, D.: HARRA: fast iterative hashed record linkage for large-scale data collections, ACM Proceedings of the 13th International Conference on Extending Database Technology, pp. 525–536 (2010)

  28. Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: taking entity resolution to the next level. IEEE Trans. Knowl. Data Eng. 26(8), 1946–1960 (2014)

    Article  Google Scholar 

  29. Papadakis, G., Ioannou, E., Niederee, C., Palpanas, T., Nejdl, W.: Eliminating the redundancy in blocking-based entity resolution methods, ACM Proceedings of the 2011 Joint International Conference on Digital Libraries, pp. 85–94 (2011)

  30. Papadakis, G., Papastefanatos, G., Palpanas, T., Koubarakis, M.: Scaling entity resolution to large, heterogeneous data with enhanced metablocking, In EDBT, pp. 221–232 (2016)

  31. Papadakis, G., Papastefanatos, G., Koutrika, G.: Supervised metablocking. ACM proceedings of the VLDB 7(14), 1929–1940 (2014)

    Article  Google Scholar 

  32. Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., Nejdl, W.: A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25(12), 2665–2682 (2013)

    Article  Google Scholar 

  33. Efthymiou, V., Stefanidis, K., Christophides, V.: Benchmarking blocking algorithms for Web entities. IEEE Trans. Big Data (2016). doi:10.1109/TBDATA.2016.2576463

  34. Efthymiou, V., Papadakis, G., Papastefanatos, G., Stefanidis, K., Palpanas, T.: Parallel meta-blocking: realizing scalable entity resolution over large, heterogeneous data, IEEE International Conference on Big data (Big data), pp. 411–420 (2015)

  35. Efthymiou, V., Papadakis, G., Papastefanatos, G., Stefanidis, K., Palpanas, T.: Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Information Systems 65, 137–157 (2017)

    Article  Google Scholar 

  36. Papadakis, G., Demartini, G., Fankhauser, P., Kärger, P.: The missing links: discovering hidden same-as links among a billion of triples, ACM Proceedings of the 12th International Conference on Information Integration and Web-based Applications and Services, pp. 453–460 (2010)

  37. Bizer, C., Heath, T., Berners-Lee, T.: Linked data—the story so far. In: Semantic Services, Interoperability and Web Applications: Emerging Concepts, pp. 205–227 (2009)

  38. http://dbtune.org/bbc/peel/

  39. Vidhya, K.A., Geetha, T.V.: Rough set theory for document clustering: a review. J. Intell. Fuzzy Syst. 32(3), 2165–2185 (2017)

    Article  Google Scholar 

  40. Vidhya, K.A., Geetha, T.V., Aghila, G.: Text document classification using Rough Set theory and Multi-level Naïve Bayes. Int. J. Appl. Eng. Res. 10(75), 331–336 (2015). (IJAER)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. A. Vidhya.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vidhya, K.A., Geetha, T.V. Resolving Entity on A Large scale: DEtermining Linked Entities and Grouping similar Attributes represented in assorted TErminologies. Distrib Parallel Databases 35, 303–332 (2017). https://doi.org/10.1007/s10619-017-7205-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-017-7205-1

Keywords

Navigation