Incremental Data Fusion Based on Provenance Information

Hara, Carmem Satie; de Aguiar Ciferri, Cristina Dutra; Ciferri, Ricardo Rodrigues

doi:10.1007/978-3-642-41660-6_18

Incremental Data Fusion Based on Provenance Information

Carmem Satie Hara²⁰,
Cristina Dutra de Aguiar Ciferri²¹ &
Ricardo Rodrigues Ciferri²²

Chapter

1240 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8000))

Abstract

Data fusion is the process of combining multiple representations of the same object, extracted from several external sources, into a single and clean representation. It is usually the last step of an integration process, which is executed after the schema matching and the entity identification steps. More specifically, data fusion aims at solving attribute value conflicts based on user-defined rules. Although there exist several approaches in the literature for fusing data, few of them focus on optimizing the process when new versions of the sources become available. In this paper, we propose a model for incremental data fusion. Our approach is based on storing provenance information in the form of a sequence of operations. These operations reflect the last fusion rules applied on the imported data. By keeping both the original source value and the new fused data in the operations repository, we are able to reliably detect source value updates, and propagate them to the fusion process, which reapplies previously defined rules whenever it is possible. This approach reduces the number of data items affected by source updates and minimizes the amount of user manual intervention in future fusion processes.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Archer, D.W., Delcambre, L.M.L., Maier, D.: A framework for fine-grained data integration and curation, with provenance, in a dataspace. In: Proceedings of the 1st Workshop on the Theory and Practice of Provenance, pp. 1–10 (2009)
Google Scholar
Batini, C., Lenzerini, M., Navathe, S.B.: Comparative analysis of methodologies for database schema integration. ACM Computing Surveys 18(4) (December 1986)
Google Scholar
Benjelloun, O., Sarma, A.D., Hayworth, C., Widom, J.: An introduction to ULDBs and the Trio system. IEEE Data Engineering Bulletin 29(1), 5–16 (2006)
Google Scholar
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. IEEE Data Engineering Bulletin 29(2), 4–12 (2006)
Google Scholar
Bilke, A., Bleiholder, J., Naumann, F., Böhm, C., Weis, M.: Automatic data fusion with hummer. In: Proceedings of the 31st VLDB Conference, pp. 1251–1254 (2005)
Google Scholar
Bleiholder, J., Naumann, F.: Conflict handling strategies in an integrated information system. In: Proceedings of the International Workshop on Information Integration on the Web, IIWeb (2006)
Google Scholar
Bleiholder, J., Naumann, F.: Data fusion. ACM Computing Survey 41(1), 1–41 (2008)
Article Google Scholar
Buneman, P., Chapman, A., Cheney, J.: Provenance management in curated databases. In: SIGMOD 2006: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pp. 539–550 (2006)
Google Scholar
Buneman, P., Chapman, A., Cheney, J., Vansummeren, S.: A provenance model for manually curated data. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 162–170. Springer, Heidelberg (2006)
Chapter Google Scholar
Buneman, P., Davidson, S., Fan, W., Hara, C., Tan, W.C.: Keys for XML. Computer Networks 39(5), 473–487 (2002)
Article Google Scholar
Buneman, P., Khanna, S., Tan, W.-C.: Data provenance: Some basic issues. In: Kapoor, S., Prasad, S. (eds.) FST TCS 2000. LNCS, vol. 1974, pp. 87–93. Springer, Heidelberg (2000)
Chapter Google Scholar
Buneman, P., Khanna, S., Tan, W.-C.: Why and where: A characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2000)
Chapter Google Scholar
Cao, Y., Fan, W., Yu, W.: Determining the relative accuracy of attributes. In: SIGMOD 2013: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 565–576 (2013)
Google Scholar
Cecchin, F., de Aguiar Ciferri, C.D., Hara, C.S.: XML data fusion. In: Bach Pedersen, T., Mohania, M.K., Tjoa, A.M. (eds.) DAWAK 2010. LNCS, vol. 6263, pp. 297–308. Springer, Heidelberg (2010)
Chapter Google Scholar
Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. The VLDB Journal 12(1), 41–58 (2003)
Article Google Scholar
Dong, X., Berti-Equille, L., Hu, Y., Srivastava, D.: SOLOMON: Seeking the truth via copying detection. PVLDB 3(2), 1617–1620 (2010)
Google Scholar
Fan, W., Geerts, F., Tang, N., Yu, W.: Inferring data currency and consistency for conflict resolution. In: ICDE 2013: Proceedings of the IEEE International Conference on Data Engineering, pp. 470–481 (2013)
Google Scholar
Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing xpath queries. In: VLDB 2002: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 95–106 (2002)
Google Scholar
Ikeda, R., Widom, J.: Panda: A system for provenance and data. IEEE Data Engineering Bulletin 33(3), 42–49 (2010)
Google Scholar
Ikeda, R., Salihoglu, S., Widom, J.: Provenance-based refresh in data-oriented workflows. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp. 1659–1668. ACM, New York (2011), http://doi.acm.org/10.1145/2063576.2063816
Google Scholar
Ives, Z.G., Green, T.J., Karvounarakis, G., Taylor, N.E., Tannen, V., Talukdar, P.P., Jacob, M., Pereira, F.: The Orchestra collaborative data sharing system. SIGMOD Record 37(3), 26–32 (2008)
Article Google Scholar
Lawrence, M., Pottinger, R., Staub-French, S.: Data coordination: Supporting contingent updates. Proceedings of the VLDB Endowment 4(11), 831–842 (2011)
Google Scholar
Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: Proceedings of the International VLDB Workshop on Clean Databases, Seoul, Korea (2006)
Google Scholar
Lim, E.P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity identification in database integration. Information Sciences 89(1) (1996)
Google Scholar
Motro, A., Anokhin, P.: Fusionplex: resolution of data inconsistencies in the integration of heterogeneous information sources. Information Fusion 7(2), 176–196 (2006)
Article Google Scholar
do Nascimento, A.M., Hara, C.S.: A model for XML instance level integration. In: SBBD 2008: Proceedings of the 23rd Brazilian Symposium on Databases, pp. 46–60 (2008)
Google Scholar
Poggi, A., Abiteboul, S.: XML data integration with identification. In: Bierman, G., Koch, C. (eds.) DBPL 2005. LNCS, vol. 3774, pp. 106–121. Springer, Heidelberg (2005)
Chapter Google Scholar
Prabhakar, S., Richardson, J., Srivastava, J., Lim, E.P.: Instance-level integration in federated autonomous databases. In: Hawaiian Conference for System Science (1993)
Google Scholar
Ramalingam, G., Reps, T.W.: An incremental algorithm for a generalization of the shortest-path problem. Journal of Algorithms 21(2), 267–305 (1996)
Article MathSciNet MATH Google Scholar
Raman, V., Hellerstein, J.M.: Potter’s wheel: An interactive data cleaning system. In: VLDB 2001: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 381–390 (2001)
Google Scholar
Sellis, T.K., Skoutas, D., Simitsis, A., Vassiliadis, P.: Data provenance in ETL scenarios. In: Proceedings of the 1st Workshop on Principles of Provenance, pp. 1–3 (2007)
Google Scholar
Shiri, N., Taghizadeh-Azari, A.: Lineage tracing in mediator-based information integration systems. In: Ramos, F.F., Larios Rosillo, V., Unger, H. (eds.) ISSADS 2005. LNCS, vol. 3563, pp. 267–282. Springer, Heidelberg (2005)
Chapter Google Scholar
Tomazela, B., Hara, C.S., Ciferri, R.R., Ciferri, C.D.A.: Empowering integration processes with data provenance. Data & Knowledge Engineering 86, 102–123 (2013)
Article Google Scholar
Weis, M., Manolescu, I.: Declarative XML data cleaning with XClean. In: Krogstie, J., Opdahl, A.L., Sindre, G. (eds.) CAiSE 2007 and WES 2007. LNCS, vol. 4495, pp. 96–110. Springer, Heidelberg (2007)
Chapter Google Scholar
Widom, J.: Trio: A system for data, uncertainty, and lineage. In: Aggarwal, C. (ed.) Managing and Mining Uncertain Data, ch. 5. Springer (2009)
Google Scholar
Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE Transactions on Knowledge and Data Engineering 20(6), 796–808 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Universidade Federal do Paraná, Curitiba, PR, Brazil
Carmem Satie Hara
Universidade de São Paulo, São Carlos, SP, Brazil
Cristina Dutra de Aguiar Ciferri
Universidade Federal de São Carlos, São Carlos, SP, Brazil
Ricardo Rodrigues Ciferri

Authors

Carmem Satie Hara
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Dutra de Aguiar Ciferri
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo Rodrigues Ciferri
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Science, University of Pennsylvania, 3330 Walnut St., 19104, Philadelphia, PA, USA
Val Tannen
School of Computing, National University of Singapore, 13 Computing Drive, 117417, Singapore, Singapore
Limsoon Wong
School of Informatics, The University of Edinburgh, 10 Crichton Street, EH8 9AB, Edinburgh, UK
Leonid Libkin , Wenfei Fan & Michael Fourman , &
Department of Computer Science, University of California, 1156 High Street, 95064, Santa Cruz, CA, USA
Wang-Chiew Tan

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hara, C.S., de Aguiar Ciferri, C.D., Ciferri, R.R. (2013). Incremental Data Fusion Based on Provenance Information. In: Tannen, V., Wong, L., Libkin, L., Fan, W., Tan, WC., Fourman, M. (eds) In Search of Elegance in the Theory and Practice of Computation. Lecture Notes in Computer Science, vol 8000. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41660-6_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-41660-6_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41659-0
Online ISBN: 978-3-642-41660-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics