Abstract
Entity Resolution (ER) constitutes a core task for data integration which aims at matching different representations of entities coming from various sources. Due to its quadratic complexity, it typically scales to large datasets through approximate, i.e., blocking methods: similar entities are clustered into blocks and pair-wise comparisons are executed only between co-occurring entities, at the cost of some missed matches. In traditional settings, it is a part of the data integration process, i.e., a preprocessing step prior to making “clean” data available to analysis. With the increasing demand of real-time analytical applications, recent research has begun to consider new approaches for integrating Entity Resolution with Query Processing. In this work, we explore the problem of query driven Entity Resolution and we propose a method for efficiently applying blocking and meta-blocking techniques during query processing. The aim of our approach is to effectively and efficiently answer SQL-like queries issued on top of dirty data. The experimental evaluation of the proposed solution demonstrates its significant advantages over the other techniques for the given problem settings.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The source code and the queries are available in https://github.com/galexiou/isip2019.
References
Christen, P.: Data Matching. Data-Centric Systems and Applications. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)
Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Workshop on Data Cleaning, Record Linkage and Object Consolidation, pp. 25–27 (2003)
Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., Nejdl, W.: A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25(12), 2665–2682 (2013)
Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. Proc. VLDB Endow. 9(4), 312–323 (2015)
Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: taking entity resolution to the next level. IEEE Trans. Knowl. Data Eng. 26(8), 1946–1960 (2013)
Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. Proc. VLDB Endow. 5(12), 2018–2019 (2012)
Getoor, L., Machanavajjhala, A.: Entity resolution for big data. In: KDD, p. 1527 (2013)
Papadakis, G., Papastefanatos, G., Palpanas, T., Koubarakis, M.: Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In: EDBT, pp. 221–232 (2016)
Ipeirotis, P.G., Verykios, V.S., Elmagarmid, A.K.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 233–246. ACM, June 2002
Altwaijry, H., Kalashnikov, D.V., Mehrotra, S.: Query-driven approach to entity resolution. Proc. VLDB Endow. 6(14), 1846–1857 (2013)
Bhattacharya, I., Getoor, L.: Query-time entity resolution. J. Artif. Intell. Res. 30, 621–657 (2007)
Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. Proc. VLDB Endow. 3(1–2), 429–438 (2010)
Ioannou, E., Garofalakis, M.: Query analytics over probabilistic databases with unmerged duplicates. IEEE Trans. Knowl. Data Eng. 27(8), 2245–2260 (2015)
Altwaijry, H., Mehrotra, S., Kalashnikov, D.V.: Query: a framework for integrating entity resolution with query processing. Proc. VLDB Endow. 9(3), 120–131 (2015)
Fier, F., Augsten, N., Bouros, P., Leser, U., Freytag, J.C.: Set similarity joins on MapReduce: an experimental survey. Proc. VLDB Endow. 11(10), 1110–1122 (2018)
Alexiou, G., Meimaris, M., Papastefanatos, G.: Enabling persistent identification of groups of duplicates in data aggregators. In: 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW). pp. 124–126. IEEE, May 2016
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010)
Kopcke, H., Thor, A., Rahm, E.: Learning-based approaches for matching web data entities. IEEE Internet Comput. 14(4), 23–31 (2010)
Thor, A., Rahm, E.: MOMA-a mapping-based object matching system. In: CIDR, pp. 247–258, January 2007
Efthymiou, V., Stefanidis, K., Christophides, V.: Minoan ER: progressive entity resolution in the web of data. In: EDBT 2016, pp. 670–671 (2016)
Acknowledgements
This research is funded by the project VisualFacts (#1614) - 1st Call of the Hellenic Foundation for Research and Innovation Research Projects for the support of post-doctoral researchers.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Alexiou, G., Papastefanatos, G. (2020). Query Driven Entity Resolution in Data Lakes. In: Flouris, G., Laurent, D., Plexousakis, D., Spyratos, N., Tanaka, Y. (eds) Information Search, Integration, and Personalization. ISIP 2019. Communications in Computer and Information Science, vol 1197. Springer, Cham. https://doi.org/10.1007/978-3-030-44900-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-44900-1_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-44899-8
Online ISBN: 978-3-030-44900-1
eBook Packages: Computer ScienceComputer Science (R0)