Skip to main content

Query Driven Entity Resolution in Data Lakes

  • Conference paper
  • First Online:
Book cover Information Search, Integration, and Personalization (ISIP 2019)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1197))

  • 284 Accesses

Abstract

Entity Resolution (ER) constitutes a core task for data integration which aims at matching different representations of entities coming from various sources. Due to its quadratic complexity, it typically scales to large datasets through approximate, i.e., blocking methods: similar entities are clustered into blocks and pair-wise comparisons are executed only between co-occurring entities, at the cost of some missed matches. In traditional settings, it is a part of the data integration process, i.e., a preprocessing step prior to making “clean” data available to analysis. With the increasing demand of real-time analytical applications, recent research has begun to consider new approaches for integrating Entity Resolution with Query Processing. In this work, we explore the problem of query driven Entity Resolution and we propose a method for efficiently applying blocking and meta-blocking techniques during query processing. The aim of our approach is to effectively and efficiently answer SQL-like queries issued on top of dirty data. The experimental evaluation of the proposed solution demonstrates its significant advantages over the other techniques for the given problem settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The source code and the queries are available in https://github.com/galexiou/isip2019.

References

  1. Christen, P.: Data Matching. Data-Centric Systems and Applications. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2

    Book  Google Scholar 

  2. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)

    Article  Google Scholar 

  3. Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Workshop on Data Cleaning, Record Linkage and Object Consolidation, pp. 25–27 (2003)

    Google Scholar 

  4. Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., Nejdl, W.: A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25(12), 2665–2682 (2013)

    Article  Google Scholar 

  5. Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. Proc. VLDB Endow. 9(4), 312–323 (2015)

    Article  Google Scholar 

  6. Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: taking entity resolution to the next level. IEEE Trans. Knowl. Data Eng. 26(8), 1946–1960 (2013)

    Article  Google Scholar 

  7. Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. Proc. VLDB Endow. 5(12), 2018–2019 (2012)

    Article  Google Scholar 

  8. Getoor, L., Machanavajjhala, A.: Entity resolution for big data. In: KDD, p. 1527 (2013)

    Google Scholar 

  9. Papadakis, G., Papastefanatos, G., Palpanas, T., Koubarakis, M.: Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In: EDBT, pp. 221–232 (2016)

    Google Scholar 

  10. Ipeirotis, P.G., Verykios, V.S., Elmagarmid, A.K.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  11. Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 233–246. ACM, June 2002

    Google Scholar 

  12. Altwaijry, H., Kalashnikov, D.V., Mehrotra, S.: Query-driven approach to entity resolution. Proc. VLDB Endow. 6(14), 1846–1857 (2013)

    Article  Google Scholar 

  13. Bhattacharya, I., Getoor, L.: Query-time entity resolution. J. Artif. Intell. Res. 30, 621–657 (2007)

    Article  Google Scholar 

  14. Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. Proc. VLDB Endow. 3(1–2), 429–438 (2010)

    Article  Google Scholar 

  15. Ioannou, E., Garofalakis, M.: Query analytics over probabilistic databases with unmerged duplicates. IEEE Trans. Knowl. Data Eng. 27(8), 2245–2260 (2015)

    Article  Google Scholar 

  16. Altwaijry, H., Mehrotra, S., Kalashnikov, D.V.: Query: a framework for integrating entity resolution with query processing. Proc. VLDB Endow. 9(3), 120–131 (2015)

    Article  Google Scholar 

  17. Fier, F., Augsten, N., Bouros, P., Leser, U., Freytag, J.C.: Set similarity joins on MapReduce: an experimental survey. Proc. VLDB Endow. 11(10), 1110–1122 (2018)

    Article  Google Scholar 

  18. Alexiou, G., Meimaris, M., Papastefanatos, G.: Enabling persistent identification of groups of duplicates in data aggregators. In: 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW). pp. 124–126. IEEE, May 2016

    Google Scholar 

  19. Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010)

    Article  Google Scholar 

  20. Kopcke, H., Thor, A., Rahm, E.: Learning-based approaches for matching web data entities. IEEE Internet Comput. 14(4), 23–31 (2010)

    Article  Google Scholar 

  21. Thor, A., Rahm, E.: MOMA-a mapping-based object matching system. In: CIDR, pp. 247–258, January 2007

    Google Scholar 

  22. Efthymiou, V., Stefanidis, K., Christophides, V.: Minoan ER: progressive entity resolution in the web of data. In: EDBT 2016, pp. 670–671 (2016)

    Google Scholar 

Download references

Acknowledgements

This research is funded by the project VisualFacts (#1614) - 1st Call of the Hellenic Foundation for Research and Innovation Research Projects for the support of post-doctoral researchers.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Giorgos Alexiou or George Papastefanatos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alexiou, G., Papastefanatos, G. (2020). Query Driven Entity Resolution in Data Lakes. In: Flouris, G., Laurent, D., Plexousakis, D., Spyratos, N., Tanaka, Y. (eds) Information Search, Integration, and Personalization. ISIP 2019. Communications in Computer and Information Science, vol 1197. Springer, Cham. https://doi.org/10.1007/978-3-030-44900-1_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-44900-1_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-44899-8

  • Online ISBN: 978-3-030-44900-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics