Why and Where: A Characterization of Data Provenance

  • Peter Buneman
  • Sanjeev Khanna
  • Wang-Chiew Tan 
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1973)


With the proliferation of database views and curated data- bases, the issue of data provenance - where a piece of data came from and the process by which it arrived in the database - is becoming increasingly important, especially in scientific databases where understanding provenance is crucial to the accuracy and currency of data. In this paper we describe an approach to computing provenance when the data of interest has been created by a database query. We adopt a syntactic approach and present results for a general data model that applies to relational databases as well as to hierarchical data such as XML. A novel aspect of our work is a distinction between “why” provenance (refers to the source data that had some influence on the existence of the data) and “where” provenance (refers to the location(s) in the source databases from which the data was extracted).

Supported in part by an Alfred P. Sloan Research Fellowship.


Normal Form Query Language Edge Label Output Expression Derivation Basis 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    INFOBIOGEN. DBCAT, The Public Catalog of Databases., cited 5 June 2000.
  2. 2.
    A. Woodruff and M. Stonebraker. Supporting fine-grained data lineage in a database visualization environment. In ICDE, pages 91–102, 1997.Google Scholar
  3. 3.
    S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web. From Relations to Semistructured Data and XML. Morgan Kaufman, 2000.Google Scholar
  4. 4.
    S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley Publishing Co, 1995.Google Scholar
  5. 5.
    S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The lorel query language for semistructured data. Journal on Digital Libraries, 1(1), 1996.Google Scholar
  6. 6.
    P. Buneman, A. Deutsch, and W. Tan. A Deterministic Model for Semistructured Data. In Proc. of the Workshop On Query Processing for Semistructured Data and Non-standard Data Formats, pages 14–19, 1999.Google Scholar
  7. 7.
    Y. Cui and J. Widom. Practical lineage tracing in data warehouses. In ICDE, pages 367–378, 2000.Google Scholar
  8. 8.
    A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. XML-QL: A Query Language for XML, 1998.
  9. 9.
    R. Durbin and J. T. Mieg. ACeDB-A C. elegans Database: Syntactic definitions for the ACeDB data base manager, 1992.
  10. 10.
    H. Liefke and S. Davidson. Efficient View Maintenance in XML Data Warehouses. Technical Report MS-CIS-99-27, University of Pennsylvania, 1999.Google Scholar
  11. 11.
    A. Klug. On conjuncitve queries containing inequalities. Journal of the ACM, 1(1):146–160, 1988.CrossRefMathSciNetGoogle Scholar
  12. 12.
    L. Wong. Normal Forms and Conservative Properties for Query Languages over Collection Types. In PODS, Washington, D.C., May 1993.Google Scholar
  13. 13.
    P. Buneman and S. Davidson and G. Hillebrand and D. Suciu. A Query Language and Optimization Techniques for Unstructured Data. In SIGMOD, pages 505–516, 1996.Google Scholar
  14. 14.
    Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across heterogeneous information sources. In ICDE, 1996.Google Scholar
  15. 15.
    World Wide Web Consortium (W3C). Document Object Model (DOM) Level 1 Specification, 2000.
  16. 16.
    World Wide Web Consortium (W3C). XML Schema Part 0: Primer, 2000.
  17. 17.
    Y. Zhuge, H. Garcia-Molina, J. Hammer, and J. Widom. View maintenance in a warehousing environment. In SIGMOD, pages 316–327, 1995.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Peter Buneman
    • 1
  • Sanjeev Khanna
    • 1
  • Wang-Chiew Tan 
    • 1
  1. 1.Department of Computer and Information ScienceUniversity of PennsylvaniaPhiladelphiaUSA

Personalised recommendations