A Provenance Model for Manually Curated Data

  • Peter Buneman
  • Adriane Chapman
  • James Cheney
  • Stijn Vansummeren
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4145)


Many curated databases are constructed by scientists integrating various existing data sources “by hand”, that is, by manually entering or copying data from other sources. Capturing provenance in such an environment is a challenging problem, requiring a good model of the process of curation. Existing models of provenance focus on queries/views in databases or computations on the Grid, not updates of databases or Web sites. In this paper we motivate and present a simple model of provenance for manually curated databases and discuss ongoing and future work.


Curated Database Source Database Target Database Provenance Information Very Large Data Base 


  1. 1.
    Bhagwat, D., Chiticariu, L., Tan, W.C., Vijayvargiya, G.: An annotation management system for relational databases. In: Proc. of the Intl. Conf. on Very Large Data Bases (VLDB), pp. 900–911. Morgan Kaufmann, San Francisco (2004)Google Scholar
  2. 2.
    Braganholo, V.P., Davidson, S.B., Heuser, C.A.: From XML view updates to relational view updates: old solutions to a new problem. In: VLDB 2004, pp. 276–287 (2004)Google Scholar
  3. 3.
    Braun, U., Garfinkel, S., Holland, D.A., Muniswamy-Reddy, K.-K., Seltzer, M.I.: Issues in automatic provenance collection. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 171–183. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. 4.
    Buneman, P.: How to cite curated databases and how to make them citable. In: SSDBM (to appear, 2006)Google Scholar
  5. 5.
    Buneman, P., Chapman, A.P., Cheney, J.: Provenance management in curated databases. In: SIGMOD (to appear, 2006)Google Scholar
  6. 6.
    Buneman, P., Khanna, S., Tan, W.-C.: Why and Where: A characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  7. 7.
    Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. In: Proceedings of the 27th VLDB Conference, Roma, Italy, pp. 41–58 (2001)Google Scholar
  8. 8.
    Dellaire, G., Farrall, R., Bickmore, W.A.: The nuclear protein database (NPD): sub-nuclear localisation and functional annotation of the nuclear proteome. Nucleic Acids Research 31(1), 328–330 (2003)CrossRefGoogle Scholar
  9. 9.
    Foster, I., Vockler, J., Eilde, M., Zhao, Y.: Chimera: A virtual data system for representing, querying, and automating data derivation. In: Ludäscher, B., Mamoulis, N. (eds.) SSDBM 2008. LNCS, vol. 5069, pp. 1–10. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  10. 10.
    Groth, P., Miles, S., Munroe, S.: Principles of high quality documentation for provenance: A philosophical discussion. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 278–286. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  11. 11.
    Groth, P., Miles, S., Fang, W., Wong, S.C., Zauner, K.-P., Moreau, L.: Recording and using provenance in a protein compressibility experiment. In: HPDC (2005)Google Scholar
  12. 12.
    Groth, P.T., Luck, M., Moreau, L.: A protocol for recording provenance in service-oriented grids. In: Higashino, T. (ed.) OPODIS 2004. LNCS, vol. 3544, pp. 124–139. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  13. 13.
    Muniswamy-Reddy, K., Holland, D., Braun, U., Seltzer, M.: Provenance-aware storage systems. In: Proceedings of the 2006 USENIX Annual Technical Conference, Boston, MA (June 2006) (to appear)Google Scholar
  14. 14.
    Roussel, N., Tabard, A., Letondal, C.: All you need is log. In: WWW 2006 Workshop on Logging Traces of Web Activity: The Mechanics of Data Collection (May 2006), Manuscript available at: http://torch.cs.dal.ca/~www2006/roussel-www2006-MechanicsDataCollection.pdf
  15. 15.
    Stevens, R.D., Robinson, A.J., Goble, C.A.: my Grid: personalised bioinformatics on the information grid. Bioinformatics (2003)Google Scholar
  16. 16.
  17. 17.
    Wang, Y.R., Madnick, S.E.: A polygen model for heterogeneous database systems: The source tagging perspective. In: McLeod, D., Sacks-Davis, R., Schek, H.-J. (eds.) 16th International Conference on Very Large Data Bases, Proceedings, Brisbane, Queensland, Australia, August 13-16, 1990, pp. 519–538. Morgan Kaufmann, San Francisco (1990)Google Scholar
  18. 18.
    Widom, J.: Trio: A system for integrated management of data, accuracy, and lineage. In: CIDR, pp. 262–276 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Peter Buneman
    • 1
  • Adriane Chapman
    • 2
  • James Cheney
    • 1
  • Stijn Vansummeren
    • 3
  1. 1.University of Edinburgh 
  2. 2.University of MichiganAnn Arbor
  3. 3.Hasselt University and Transnational University of LimburgBelgium

Personalised recommendations