Why and Where: A Characterization of Data Provenance
With the proliferation of database views and curated data- bases, the issue of data provenance - where a piece of data came from and the process by which it arrived in the database - is becoming increasingly important, especially in scientific databases where understanding provenance is crucial to the accuracy and currency of data. In this paper we describe an approach to computing provenance when the data of interest has been created by a database query. We adopt a syntactic approach and present results for a general data model that applies to relational databases as well as to hierarchical data such as XML. A novel aspect of our work is a distinction between “why” provenance (refers to the source data that had some influence on the existence of the data) and “where” provenance (refers to the location(s) in the source databases from which the data was extracted).
Supported in part by an Alfred P. Sloan Research Fellowship.
KeywordsNormal Form Query Language Edge Label Output Expression Derivation Basis
Unable to display preview. Download preview PDF.
- 1.INFOBIOGEN. DBCAT, The Public Catalog of Databases. http://www.infobiogen.fr/services/dbcat/, cited 5 June 2000.
- 2.A. Woodruff and M. Stonebraker. Supporting fine-grained data lineage in a database visualization environment. In ICDE, pages 91–102, 1997.Google Scholar
- 3.S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web. From Relations to Semistructured Data and XML. Morgan Kaufman, 2000.Google Scholar
- 4.S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley Publishing Co, 1995.Google Scholar
- 5.S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The lorel query language for semistructured data. Journal on Digital Libraries, 1(1), 1996.Google Scholar
- 6.P. Buneman, A. Deutsch, and W. Tan. A Deterministic Model for Semistructured Data. In Proc. of the Workshop On Query Processing for Semistructured Data and Non-standard Data Formats, pages 14–19, 1999.Google Scholar
- 7.Y. Cui and J. Widom. Practical lineage tracing in data warehouses. In ICDE, pages 367–378, 2000.Google Scholar
- 8.A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. XML-QL: A Query Language for XML, 1998. http://www.w3.org/TR/NOTE-xml-ql.
- 9.R. Durbin and J. T. Mieg. ACeDB-A C. elegans Database: Syntactic definitions for the ACeDB data base manager, 1992. http://probe.nalusda.gov:8000/acedocs/syntax.html.
- 10.H. Liefke and S. Davidson. Efficient View Maintenance in XML Data Warehouses. Technical Report MS-CIS-99-27, University of Pennsylvania, 1999.Google Scholar
- 12.L. Wong. Normal Forms and Conservative Properties for Query Languages over Collection Types. In PODS, Washington, D.C., May 1993.Google Scholar
- 13.P. Buneman and S. Davidson and G. Hillebrand and D. Suciu. A Query Language and Optimization Techniques for Unstructured Data. In SIGMOD, pages 505–516, 1996.Google Scholar
- 14.Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across heterogeneous information sources. In ICDE, 1996.Google Scholar
- 15.World Wide Web Consortium (W3C). Document Object Model (DOM) Level 1 Specification, 2000. http://www.w3.org/TR/REC-DOM-Level-1.
- 16.World Wide Web Consortium (W3C). XML Schema Part 0: Primer, 2000. http://www.w3.org/TR/xmlschema-0/.
- 17.Y. Zhuge, H. Garcia-Molina, J. Hammer, and J. Widom. View maintenance in a warehousing environment. In SIGMOD, pages 316–327, 1995.Google Scholar