Advertisement

Incremental entity resolution process over query results for data integration systems

  • Priscilla Kelly Machado VieiraEmail author
  • Bernadette Farias Lóscio
  • Ana Carolina Salgado
Article
  • 14 Downloads

Abstract

Entity Resolution (ER) in data integration systems is the problem of identifying groups of tuples from one or multiple data sources that represent the same real-world entity. This is a crucial stage of data integration processes, which often need to integrate data at query-time. This task becomes even more challenging in scenarios with dynamic data sources or when a large volume of data needs to be integrated. Then, to deal with large volumes of data, new ER solutions have been proposed. One possible approach consists in performing the ER process over query results rather than in the whole set of tuples being integrated. Additionally, previous results of ER tasks can be reused in order to reduce the number of comparisons between pairs of tuples at query-time. In a similar way, indexing techniques can also be employed to help the identification of equivalent tuples and to reduce the number of comparisons between pairs of tuples. In this context, this work proposes an incremental ER process over query results. The contributions of this work are the specification, the implementation and the evaluation of the proposed incremental process. We performed some experiments and we concluded that the incremental ER at query-time is more efficient than traditional ER processes.

Keywords

Data integration Entity resolution Record linkage Duplicate detection Incremental entity resolution. 

Notes

Acknowledgements

The authors thank Center of Informatics at Federal University of Pernambuco, Brazil, for the infrastructure for development of this research.

References

  1. Altowim, Y., Kalashnikov, D.V., Mehrotra, S. (2014). Progressive approach to relational entity resolution. Proceedings of the VLDB Endowment, 7(11), 999–1010.  https://doi.org/10.14778/2732967.2732975.CrossRefGoogle Scholar
  2. Altwaijry, H., Kalashnikov, D.V., Mehrotra, S. (2013). Query-driven approach to entity resolution. Proceedings of the VLDB Endowment, 6(14), 1846–1857.  https://doi.org/10.14778/2556549.2556567.CrossRefGoogle Scholar
  3. Altwaijry, H., Mehrotra, S., Kalashnikov, D.V. (2015). Query: a framework for integrating entity resolution with query processing. Proceedings of the VLDB Endowment, 9 (3), 120–131.  https://doi.org/10.14778/2850583.2850587.CrossRefGoogle Scholar
  4. Bellahsene, Z., Bonifati, A., Rahm, E. (2011). Schema matching and mapping, 1st edn. Heidelberg: Springer.CrossRefzbMATHGoogle Scholar
  5. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J. (2009). Swoosh: a generic approach to entity resolution. The VLDB Journal, 18(1), 255–276.CrossRefGoogle Scholar
  6. Bhattacharya, I., & Getoor, L. (2007). Query-time entity resolution. Journal of Artificial Intelligence Research (JAIR), 30, 621–657.CrossRefzbMATHGoogle Scholar
  7. Bhattacharya, I., & Getoor, L. (2006). Entity Resolution in Graphs, (pp. 311–344). New York: Wiley.Google Scholar
  8. Christen, P. (2008). Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08 (pp. 1065–1068). New York: ACM.  https://doi.org/10.1145/1401890.1402020. http://doi.acm.org/10.1145/1401890.1402020
  9. Christen, P. (2012). Data matching - concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications. Berlin: Springer.Google Scholar
  10. Day, W.H., & Edelsbrunner, H. (1984). Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1(1), 7–24.CrossRefzbMATHGoogle Scholar
  11. Doan, A., Halevy, A., Ives, Z. (2012). Principles of data integration, 1st edn. San Francisco : Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id=2401764.Google Scholar
  12. Dong, X.L., & Srivastava, D. (2015). Big data integration. Synthesis lectures on data management. Morgan & Claypool Publishers.  https://doi.org/10.2200/S00578ED1V01Y201404DTM040.
  13. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S. (2007). Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16. http://dblp.uni-trier.de/db/journals/tkde/tkde19.html#ElmagarmidIV07.CrossRefGoogle Scholar
  14. Euzenat, J., & Shvaiko, P. (2013). Ontology matching, 2nd edn. Berlin: Springer.CrossRefzbMATHGoogle Scholar
  15. Firmani, D., Saha, B., Srivastava, D. (2016). Online entity resolution using an oracle. Proceedings of the VLDB Endowment, 9(5), 384–395. http://dblp.uni-trier.de/db/journals/pvldb/pvldb9.html#FirmaniSS16.CrossRefGoogle Scholar
  16. Getoor, L., & Machanavajjhala, A. (2012). Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 5 (12), 2018–2019. http://dblp.uni-trier.de/db/journals/pvldb/pvldb5.html#GetoorM12.CrossRefGoogle Scholar
  17. Gruenheid, A., Dong, X.L., Srivastava, D. (2014). Incremental record linkage. Proceedings of the VLDB Endowment, 7(9), 697–708. http://dblp.uni-trier.de/db/journals/pvldb/pvldb7.html#GruenheidDS14.CrossRefGoogle Scholar
  18. Guo, S., Dong, X., Srivastava, D., Zajac, R. (2010). Record linkage with uniqueness constraints and erroneous values. Proceedings of the VLDB Endowment, 3 (1), 417–428. http://dblp.uni-trier.de/db/journals/pvldb/pvldb3.html#GuoDSZ10.CrossRefGoogle Scholar
  19. Huang, J., Ertekin, S., Giles, C.L. (2006). Efficient name disambiguation for large-scale databases. In J. Fürnkranz, T. Scheffer, M. Spiliopoulou (Eds.) , PKDD, Lecture Notes in Computer Science (Vol. 4213, pp. 536–544). http://dblp.uni-trier.de/db/conf/pkdd/pkdd2006.html#HuangEG06. Berlin: Springer.
  20. Jin, C., Chen, Z., Hendrix, W., Agrawal, A., Choudhary, A.N. (2015). Incremental, distributed single-linkage hierarchical clustering algorithm using mapreduce. In L.T. Watson, J. Weinbub, M. Sosonkina, W.I. Thacker (Eds.) , SpringSim (HPS). SCS/ACM. http://dblp.uni-trier.de/db/conf/springsim/springsim2015-5.html#JinCHAC15 (pp. 83–92).
  21. Kogan, J., Nicholas, C.K., Teboulle, M. (Eds.). (2006). Grouping multidimensional data - recent advances in clustering. Berlin: Springer. http://dblp.uni-trier.de/db/books/daglib/0015184.html.
  22. Köpcke, H., & Rahm, E. (2010). Frameworks for entity matching: a comparison. Data and Knowledge Engineering, 69(2), 197–210. http://dblp.uni-trier.de/db/journals/dke/dke69.html#KopckeR10.CrossRefGoogle Scholar
  23. Lenzerini, M. (2011). Ontology-based data management. In Proceedings of the 20th ACM international conference on information and knowledge management, CIKM ’11.  https://doi.org/10.1145/2063576.2063582. http://doi.acm.org/10.1145/2063576.2063582 (pp. 5–6). New York: ACM.
  24. Li, Y., Swarup, V., Jajodia, S. (2003). Constructing a virtual primary key for fingerprinting relational data, (pp. 133–141). New York: ACM.  https://doi.org/10.1145/947380.947398. http://doi.acm.org/10.1145/947380.947398.Google Scholar
  25. Mamun, A.A., Mi, T., Aseltine, R., Rajasekaran, S. (2013). Efficient sequential and parallel algorithms for record linkage. Journal of the American Medical Informatics Association, 21(2), 252–262.CrossRefGoogle Scholar
  26. Mathieu, C., Sankur, O., Schudy, W. (2010). Online correlation clustering. In J.Y. Marion, & T. Schwentick (Eds.) , STACS, LIPIcs (Vol. 5, pp. 573–584). Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik. http://dblp.uni-trier.de/db/conf/stacs/stacs2010.html#MathieuSS10.
  27. On, B.W., Lee, I., Choi, G.S., Park, H.S. (2014). Discriminative and deterministic approaches towards entity resolution. Journal of Intelligent Information System, 43 (1), 101–127. http://dblp.uni-trier.de/db/journals/jiis/jiis43.html#OnLCP14.CrossRefGoogle Scholar
  28. Otero-Cerdeira, L., Rodríguez-Martínez, F.J., Gómez-Rodríguez, A. (2014). Ontology matching: a literature review. Expert Systems with Applications, 42, 949–971.CrossRefGoogle Scholar
  29. Papadakis, G., Svirsky, J., Gal, A., Palpanas, T. (2016). Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment, 9(9), 684–695. http://dblp.uni-trier.de/db/journals/pvldb/pvldb9.html#0001SGP16.CrossRefGoogle Scholar
  30. Papenbrock, T., Heise, A., Naumann, F. (2015). Progressive duplicate detection. IEEE Transactions on Knowledge and Data Engineering, 27(5), 1316–1329. http://dblp.uni-trier.de/db/journals/tkde/tkde27.html#PapenbrockHN15.CrossRefGoogle Scholar
  31. Pochampally, R., Sarma, A.D., Dong, X.L., Meliou, A., Srivastava, D. (2015). Fusing data with correlations. CoRR arXiv:1503.00306. http://dblp.uni-trier.de/db/journals/corr/corr1503.html#PochampallySDMS15.
  32. Rahm, E., & Bernstein, P.A. (2001). A survey of approaches to automatic schema matching. The VLDB Journal, 10(4), 334–350.  https://doi.org/10.1007/s007780100057.CrossRefzbMATHGoogle Scholar
  33. Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B. (2016). Sjclust: towards a framework for integrating similarity join algorithms and clustering. In S. Hammoudi, L.A. Maciaszek, M. Missikoff, O. Camp, J. Cordeiro (Eds.) , ICEIS (1). SciTePress. http://dblp.uni-trier.de/db/conf/iceis/iceis2016-1.html#RibeiroCBN16 (pp. 75–80).
  34. Rubim, I.C., & Braganholo, V. (2017). Detecting referential inconsistencies in electronic cv datasets. Journal of the Brazilian Computer Society, 23(1), 3:1–3:11. http://dblp.uni-trier.de/db/journals/jbcs/jbcs23.html#RubimB17.CrossRefGoogle Scholar
  35. Su, W., Wang, J., Lochovsky, F.H., Society, I.C. (2010). Record matching over query results from multiple web databases. IEEE Transactions on Knowledge and Data Engineering, 22(4), 578–589.CrossRefGoogle Scholar
  36. Vieira, P., Salgado, A.C., Lóscio, B.F. (2016). A query-driven, incremental process for entity resolution. In R. Pichler, & A.S. da Silva (Eds.) , AMW, CEUR Workshop Proceedings (Vol. 1644). URL http://dblp.uni-trier.de/db/conf/amw/amw2016.html#VieiraSL16.
  37. Vieira, P.K.M., Lóscio, B.F., Salgado, A.C. (2017). Dynamic indexing for incremental entity resolution in data integration systems. In S. Hammoudi, M. Smialek, O. Camp, J. Filipe (Eds.) , ICEIS (1). SciTePress. http://dblp.uni-trier.de/db/conf/iceis/iceis2017-1.html#VieiraLS17 (pp. 185–192).
  38. Whang, S.E., & Garcia-Molina, H. (2010). Entity resolution with evolving rules. Proceedings of the VLDB Endowment, 3(1–2), 1326–1337.CrossRefGoogle Scholar
  39. Whang, S.E., & Garcia-Molina, H. (2014). Incremental entity resolution on rules and data. VLDB Journal, 23(1), 77–102. http://dblp.uni-trier.de/db/journals/vldb/vldb23.html#WhangG14.CrossRefGoogle Scholar
  40. Whang, S.E., Marmaros, D., Garcia-Molina, H. (2013). Pay-as-you-go entity resolution. IEEE Transactions on Knowledge and Data Engineering, 25(5), 1111–1124. http://dblp.uni-trier.de/db/journals/tkde/tkde25.html#WhangMG13.CrossRefGoogle Scholar
  41. Widyantoro, D.H., Ioerger, T.R., Yen, J. (2002). An incremental approach to building a cluster hierarchy. In ICDM. IEEE Computer Society. http://dblp.uni-trier.de/db/conf/icdm/icdm2002.html#WidyantoroIY02 (pp. 705–708).
  42. Young, S.R., Arel, I., Karnowski, T.P., Rose, D.C. (2010). A fast and stable incremental clustering algorithm. In S. Latifi (Ed.) , ITNG. IEEE Computer Society. http://dblp.uni-trier.de/db/conf/itng/itng2010.html#YoungAKR10 (pp. 204–209).

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Federal Rural University of PernambucoRecifeBrazil

Personalised recommendations