Skip to main content

Incremental Data Fusion Based on Provenance Information

  • Chapter

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8000))

Abstract

Data fusion is the process of combining multiple representations of the same object, extracted from several external sources, into a single and clean representation. It is usually the last step of an integration process, which is executed after the schema matching and the entity identification steps. More specifically, data fusion aims at solving attribute value conflicts based on user-defined rules. Although there exist several approaches in the literature for fusing data, few of them focus on optimizing the process when new versions of the sources become available. In this paper, we propose a model for incremental data fusion. Our approach is based on storing provenance information in the form of a sequence of operations. These operations reflect the last fusion rules applied on the imported data. By keeping both the original source value and the new fused data in the operations repository, we are able to reliably detect source value updates, and propagate them to the fusion process, which reapplies previously defined rules whenever it is possible. This approach reduces the number of data items affected by source updates and minimizes the amount of user manual intervention in future fusion processes.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Archer, D.W., Delcambre, L.M.L., Maier, D.: A framework for fine-grained data integration and curation, with provenance, in a dataspace. In: Proceedings of the 1st Workshop on the Theory and Practice of Provenance, pp. 1–10 (2009)

    Google Scholar 

  2. Batini, C., Lenzerini, M., Navathe, S.B.: Comparative analysis of methodologies for database schema integration. ACM Computing Surveys 18(4) (December 1986)

    Google Scholar 

  3. Benjelloun, O., Sarma, A.D., Hayworth, C., Widom, J.: An introduction to ULDBs and the Trio system. IEEE Data Engineering Bulletin 29(1), 5–16 (2006)

    Google Scholar 

  4. Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. IEEE Data Engineering Bulletin 29(2), 4–12 (2006)

    Google Scholar 

  5. Bilke, A., Bleiholder, J., Naumann, F., Böhm, C., Weis, M.: Automatic data fusion with hummer. In: Proceedings of the 31st VLDB Conference, pp. 1251–1254 (2005)

    Google Scholar 

  6. Bleiholder, J., Naumann, F.: Conflict handling strategies in an integrated information system. In: Proceedings of the International Workshop on Information Integration on the Web, IIWeb (2006)

    Google Scholar 

  7. Bleiholder, J., Naumann, F.: Data fusion. ACM Computing Survey 41(1), 1–41 (2008)

    Article  Google Scholar 

  8. Buneman, P., Chapman, A., Cheney, J.: Provenance management in curated databases. In: SIGMOD 2006: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pp. 539–550 (2006)

    Google Scholar 

  9. Buneman, P., Chapman, A., Cheney, J., Vansummeren, S.: A provenance model for manually curated data. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 162–170. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  10. Buneman, P., Davidson, S., Fan, W., Hara, C., Tan, W.C.: Keys for XML. Computer Networks 39(5), 473–487 (2002)

    Article  Google Scholar 

  11. Buneman, P., Khanna, S., Tan, W.-C.: Data provenance: Some basic issues. In: Kapoor, S., Prasad, S. (eds.) FST TCS 2000. LNCS, vol. 1974, pp. 87–93. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  12. Buneman, P., Khanna, S., Tan, W.-C.: Why and where: A characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  13. Cao, Y., Fan, W., Yu, W.: Determining the relative accuracy of attributes. In: SIGMOD 2013: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 565–576 (2013)

    Google Scholar 

  14. Cecchin, F., de Aguiar Ciferri, C.D., Hara, C.S.: XML data fusion. In: Bach Pedersen, T., Mohania, M.K., Tjoa, A.M. (eds.) DAWAK 2010. LNCS, vol. 6263, pp. 297–308. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  15. Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. The VLDB Journal 12(1), 41–58 (2003)

    Article  Google Scholar 

  16. Dong, X., Berti-Equille, L., Hu, Y., Srivastava, D.: SOLOMON: Seeking the truth via copying detection. PVLDB 3(2), 1617–1620 (2010)

    Google Scholar 

  17. Fan, W., Geerts, F., Tang, N., Yu, W.: Inferring data currency and consistency for conflict resolution. In: ICDE 2013: Proceedings of the IEEE International Conference on Data Engineering, pp. 470–481 (2013)

    Google Scholar 

  18. Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing xpath queries. In: VLDB 2002: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 95–106 (2002)

    Google Scholar 

  19. Ikeda, R., Widom, J.: Panda: A system for provenance and data. IEEE Data Engineering Bulletin 33(3), 42–49 (2010)

    Google Scholar 

  20. Ikeda, R., Salihoglu, S., Widom, J.: Provenance-based refresh in data-oriented workflows. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp. 1659–1668. ACM, New York (2011), http://doi.acm.org/10.1145/2063576.2063816

    Google Scholar 

  21. Ives, Z.G., Green, T.J., Karvounarakis, G., Taylor, N.E., Tannen, V., Talukdar, P.P., Jacob, M., Pereira, F.: The Orchestra collaborative data sharing system. SIGMOD Record 37(3), 26–32 (2008)

    Article  Google Scholar 

  22. Lawrence, M., Pottinger, R., Staub-French, S.: Data coordination: Supporting contingent updates. Proceedings of the VLDB Endowment 4(11), 831–842 (2011)

    Google Scholar 

  23. Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: Proceedings of the International VLDB Workshop on Clean Databases, Seoul, Korea (2006)

    Google Scholar 

  24. Lim, E.P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity identification in database integration. Information Sciences 89(1) (1996)

    Google Scholar 

  25. Motro, A., Anokhin, P.: Fusionplex: resolution of data inconsistencies in the integration of heterogeneous information sources. Information Fusion 7(2), 176–196 (2006)

    Article  Google Scholar 

  26. do Nascimento, A.M., Hara, C.S.: A model for XML instance level integration. In: SBBD 2008: Proceedings of the 23rd Brazilian Symposium on Databases, pp. 46–60 (2008)

    Google Scholar 

  27. Poggi, A., Abiteboul, S.: XML data integration with identification. In: Bierman, G., Koch, C. (eds.) DBPL 2005. LNCS, vol. 3774, pp. 106–121. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  28. Prabhakar, S., Richardson, J., Srivastava, J., Lim, E.P.: Instance-level integration in federated autonomous databases. In: Hawaiian Conference for System Science (1993)

    Google Scholar 

  29. Ramalingam, G., Reps, T.W.: An incremental algorithm for a generalization of the shortest-path problem. Journal of Algorithms 21(2), 267–305 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  30. Raman, V., Hellerstein, J.M.: Potter’s wheel: An interactive data cleaning system. In: VLDB 2001: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 381–390 (2001)

    Google Scholar 

  31. Sellis, T.K., Skoutas, D., Simitsis, A., Vassiliadis, P.: Data provenance in ETL scenarios. In: Proceedings of the 1st Workshop on Principles of Provenance, pp. 1–3 (2007)

    Google Scholar 

  32. Shiri, N., Taghizadeh-Azari, A.: Lineage tracing in mediator-based information integration systems. In: Ramos, F.F., Larios Rosillo, V., Unger, H. (eds.) ISSADS 2005. LNCS, vol. 3563, pp. 267–282. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  33. Tomazela, B., Hara, C.S., Ciferri, R.R., Ciferri, C.D.A.: Empowering integration processes with data provenance. Data & Knowledge Engineering 86, 102–123 (2013)

    Article  Google Scholar 

  34. Weis, M., Manolescu, I.: Declarative XML data cleaning with XClean. In: Krogstie, J., Opdahl, A.L., Sindre, G. (eds.) CAiSE 2007 and WES 2007. LNCS, vol. 4495, pp. 96–110. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  35. Widom, J.: Trio: A system for data, uncertainty, and lineage. In: Aggarwal, C. (ed.) Managing and Mining Uncertain Data, ch. 5. Springer (2009)

    Google Scholar 

  36. Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE Transactions on Knowledge and Data Engineering 20(6), 796–808 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Hara, C.S., de Aguiar Ciferri, C.D., Ciferri, R.R. (2013). Incremental Data Fusion Based on Provenance Information. In: Tannen, V., Wong, L., Libkin, L., Fan, W., Tan, WC., Fourman, M. (eds) In Search of Elegance in the Theory and Practice of Computation. Lecture Notes in Computer Science, vol 8000. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41660-6_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41660-6_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41659-0

  • Online ISBN: 978-3-642-41660-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics