Earth Science Informatics

, Volume 3, Issue 3, pp 167–196 | Cite as

A mathematical framework for earth science data provenance tracing

Software Article

Abstract

This paper identifies three distinct data production paradigms for Earth science data, each having its own versioning structure:
  • Climate data record production, used when the data producer’s dominant concern is providing a homogeneous error structure for each data set version, particularly when the data record is expected to cover a long time period

  • Operational data set production, used when the producer must ensure low latency and service continuity with less attention to error homogeneity across the entire record

  • Exploratory production, used for validation or research in which the producer decides which processes to apply by interacting with the data. In this paradigm, there may not be a common versioning structure from one production episode to another

This paper then develops a mathematical framework for three provenance tracing activities that are important in long-term preservation of Earth science data:
  • tracing the history of data production that created an item of Earth science data, with particular attention to the versioning structure of the data collections

  • tracing the history of custody for an item

  • tracing the history of Intellectual Property Rights transfers for an item

Each of these activities has its own type of Directed Acyclic Graph (DAG) underlying a particular kind of provenance. Provenance tracing is equivalent to performing a Breadth First Search on the appropriate DAG.

Keywords

Data provenance Data production paradigms Production history provenance Provenance of custodianship Provenance of intellectual property rights 

Notes

Acknowledgements

The author is deeply grateful to the reviewers of this paper for helping him to remove a number of misconceptions and to clarify the writing.

References

  1. Abiteboul S, Quass D, McHugh J, Widom J, Wiener J (1997) The Lorel query language for semistructured data. Int J Digit Libr 1:1CrossRefGoogle Scholar
  2. Appell D (2009) Stumbling over data: mistakes fuel climate-warming skeptics. Sci Am 301:19–20CrossRefGoogle Scholar
  3. Barker A, Hemert JV (2008) Scientific workflow: a survey and research directions. In: Parallel processing and applied mathematics lecture notes in computer science, vol 4967. Springer Berlin, pp 746–753Google Scholar
  4. Barkstrom BR (1984) The earth radiation budget experiment (ERBE). Bull Am Meteorol Soc 65:1170–1185CrossRefGoogle Scholar
  5. Barkstrom BR (2003) Data product configuration management and versioning in large-scale production of satellite scientific data. In: Westfechtel B, van den Hoek A (eds) Software configuration management/ICSE workshops SCM 2001 and SCM 2003, Toronto, Canada, May 2001 and Portland, OR, USA, May 2003. Lecture notes in computer science, vol 2649. Springer, Berlin, pp 118–133Google Scholar
  6. Barton J, Whitfield E (2005) Letter to Dr. Michael Mann dated June 23, 2005. Available online at http://republicans.energycommerce.house.gov/108/Letters/062305_Mann.pdf after going to http://republicans.energycommerce.house.gov/ and doing a search for “Letter to Dr. Mann”. Accessed 29 Sept 2009
  7. Baudin M (1990) Manufacturing systems analysis: with application to production scheduling. Prentice-Hall, Englewood CliffsGoogle Scholar
  8. Belhajjame K, Wolstencroft K, Corcho O, Oinn T, Tanoh F, William A, Goble C (2008) Metadata management in the Taverna workflow system. In: IEEE international symposium on cluster computing and the grid, pp 651–656Google Scholar
  9. Bose R (2002) A conceptual framework for composing and managing scientific data lineage. In: Proc. 14th international conf. on scientific and statistical database management, pp 15–19Google Scholar
  10. Bose R, Frew J (2004) Composing lineage metadata with XML for custom satellite-derived data products SSBDM. In: 16th international conf. on scientific and statistical database management (SSBDM’04), p 275Google Scholar
  11. Bose R, Frew J (2005) Lineage retrieval for scientific data processing: a survey. ACM Comput Surv 37:1–28CrossRefGoogle Scholar
  12. Buneman P, Suciu D (2007) Data Eng 32(special issue):1–58Google Scholar
  13. Buneman P, Khanna S, Tan W-C (2002) On propagation of deletions and annotations through views. In: PODS ’02: proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. Madison, Wisconsin, 3–6 June 2002Google Scholar
  14. Buneman P, Khanna S, Tan W-C (2002) Computing provenance and annotations for views. Workshop Paper: Workshop on Data Derivation and Provenance (Oct.), Chicago, ILGoogle Scholar
  15. Buneman P, Fernandez M, Suciu D (2000) UnQL: a query language and algebra for semistructured data based on structural recursion. VLDB J 9:76–110CrossRefGoogle Scholar
  16. Buneman P, Khanna S, Tajima K, Tan W-C (2004) Archiving scientific data. Trans Database Syst (TODS) 29:2–42CrossRefGoogle Scholar
  17. Buneman P, Cheney J, Tajima, Tan W-C, Vansummeren S (2008) Curated databases. In: PODS ’08: proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. Vancouver, BC, Canada, 9–12 June 2008Google Scholar
  18. Burroughs J (2010) Web page on quality control for the integrated global radiosonde archive. Available at http://www.ncdc.noaa.gov/oa/climate/igra/index.php
  19. Cane MA, Kaplan A, Miller RN, Tang B, Hackett EC, Busalacci AJ (1996) Mapping tropical Pacific sea level: data assimilation via a reduced state space Kalman filter. J Geophys Res 101(C10):22599–22617CrossRefGoogle Scholar
  20. CCSDS (2002) Reference model for an open archival information system (OAIS). Consultative Committee for Space Data Systems, CCSDS 650.0-B-1, Blue Book, CCSDS Secretariat, Washington, DCGoogle Scholar
  21. Chase RB, Aquilano NJ, Jacobs FR (1998) Production and operations management: manufacturing and services. Irwin McGraw-Hill, BostonGoogle Scholar
  22. Chebotko A, Lin C, Fei X, Lai Z, Lu S, Hua J, Fotouhi F (2007) VIEW: a VIsual sciEntificWorkflow management system. In: IEEE congress on services, Salt Lake City, Utah, USA, 9–13 July 2007Google Scholar
  23. Cheney J, Buneman P, Ludäscher B (2008) Report on the principles of provenance workshop. SIGMOD Rec 37:62–65CrossRefGoogle Scholar
  24. Committee on Climate Data Records from NOAA Operational Satellites (2004) Climate data records from environmental satellites. National Academies, WashingtonGoogle Scholar
  25. Committee on Surface Temperature Reconstructions for the past 2,000 Years (2006) Surface temperature reconstructions for the last 2,000 years. National Academies, WashingtonGoogle Scholar
  26. Consens MP, Mendelzon AO (1990) GraphLog: a visual formalism for real life recursion. In: PODS ’90. ACM, New York, pp 404–416CrossRefGoogle Scholar
  27. Conway E, Dunckley M, McIlwrath B, Giaretta D (2009) Preservation network models: creating stable networks of information to ensure the long term use of scientific data. In: Proc. PV2009, Madrid, Spain, 1–3 Dec 2009Google Scholar
  28. Cormen TH, Lieserson CE, Rivest RL (1997) Introduction to algorithms. MIT, CambridgeGoogle Scholar
  29. Cui Y, Widom J, Wiener JL (2000) Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst. 26:179–227CrossRefGoogle Scholar
  30. Easterling DR, Karl TR, Mason EH, Hughes PY, Bowman DP (1996) United states historical climatology network (US HCN) monthly temperature and precipitation data. ORNL/CDIAC-87, NDP-019/R3. Carbon Dioxide Information Analysis Center, Oak Ridge National Laboratory, US Department of Energy, Oak Ridge, TennesseeGoogle Scholar
  31. Eifrem E (2009) Neo4j—the benefits of graph databases. In: O’Reilly open source convention, 20–24 July 2009. Available online at http://en.oreilly.com/oscon2009/public/schedule/detail/8364
  32. ESW (2009) ESW wiki—large TripleStores. Available online at http://esw.w3.org/topic/LargeTripleStores
  33. Euler L (1736) Solutio problematis ad geometriam situs pertinentis. Comment Acad Sci Imper Petropol 8:128–140Google Scholar
  34. Fleig AJ, Tilmes C (2006) Provenance and reuse: essential elements for long term climate data sets. EOS Trans. AGU 87Google Scholar
  35. Foster I, Vockler J, Wilde M, Zhao Y (2002) Chimera: a virtual data system for representing, querying, and automating data derivation. In: Proc. 14th int. conf. on scientific and statistical database management, pp 37–46Google Scholar
  36. Frew J, Metzger D, Slaughter P (2008) Automatic capture and reconstruction of computational provenance. Concurrency Comput Pract Exper 20:485–496CrossRefGoogle Scholar
  37. Frew J, Bose R (2001) Earth system science workbench: a data management infrastructure for earth science products. In: Fairfax VA, Kerschberg L, Kafatos M (eds) Proc. of the 13th international conference on scientific and statistical database management (SSDBM ’01) (July). IEEE Computer Society, Washington, pp 180–189Google Scholar
  38. Gershwin SB (1994) Manufacturing systems engineering. PTR Prentice Hall, Englewood CliffsGoogle Scholar
  39. Giaretta D (2007) The CASPAR approach to digital preservation. Int J Digit Curation 2:112–131Google Scholar
  40. Gibbons A (1985) Algorithmic graph theory. Cambridge University Press, CambridgeGoogle Scholar
  41. Groth P, Jiang S, Miles S, Munroe S, Tan V, Tsasakou S, Moreau L (2006) An architecture for a provenance system: enabling and supporting provenance in grids for complex problems. Available online at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.96.3841&rep=repl&type=pdf
  42. Guan Z, Hernandez F, Bangalore P, Gray J, Skjellum A, Velusamy V, Liu Y (2005) Grid-flow: a grid-enabled scientific workflow system with a Petri-net-based interface. Concurrency Comput Pract Exper 18:1115–1140CrossRefGoogle Scholar
  43. Hook R, Romaniello M, Ullgrén M, Maisala S, Solin O, Oittinen T, Savolainen V, Järveläinen P, Tyynelä J, Péron M, Izzo C, Ballester P, Gabasch A (2006) ESO reflex: a graphical workflow engine for astronomical data reduction. Messenger 131:41–44. Available online at http://www.eso.org/sci/publications/messenger/archive/no.131-mar08/messenger-no131-42.pdf. Accessed 28 Sept 2009Google Scholar
  44. Jüngnickel D (1999) Graphs, networks and algorithms. Springer, BerlinGoogle Scholar
  45. Knuth DE (1993) The Stanford GraphBase: a platform for combinatorial computing. Addison-Wesley, ReadingGoogle Scholar
  46. Knuth DE (1997) The art of computer programming: vol 1. Fundamental algorithms, 3rd edn. Addison-Wesley, BostonGoogle Scholar
  47. Loeb NG, Wielicki BA, Doelling DR, Kato S, Wond T, Smith GL, Keyes DF, Manalo-Smith N (2009) Toward optimal closure of the earth’s top-of-atmosphere radiation budget. J Clim 22:748–766CrossRefGoogle Scholar
  48. Lorenc AC, Ballard SP, Bell RS, Ingleby NB, Andrews PLF, Barker DM, Bray JR, Clayton AM, Dalby T, Li D, Payne TJ, Saunders FW (2006) The met. office global three-dimensional variational data assimilation scheme. Q J Royal Meteorol Soc 126:2991–3012CrossRefGoogle Scholar
  49. Mann M, Bradley E, Hughes RS, Malcolm K (1998) Global-scale temperature patterns and climate forcing over the past six centuries. Nature 392:779–787CrossRefGoogle Scholar
  50. McIntyre S, McKitrick R (2003) Corrections to the Mann et al. proxy data base and northern hemisphere average temperature series. Energy Environ 14:751–772CrossRefGoogle Scholar
  51. Miles S, Groth P, Munroe S, Jiang S, Assandri T, Moreau L (2000) Extracting causal graphs from an open provenance model. Concurrency Comput Pract Exper 00:1–7Google Scholar
  52. MIT World (2009) The climategate debate, on-line discussion. Available online at http://mitworld.mit.edu/video/730. Accessed 3 Feb 2010
  53. Moradkhani H, Sorooshian S, Gupta HV, Houser PR (2004) Dual state-parameter estimation of hydrological models using Kalman filter. Adv Water Plann 26:135–147Google Scholar
  54. Moreau L, Groth P (2009) Open provenance challenge. Available online at http://twiki.ipaw.info/bin/view/Challenge/WebHome
  55. Moreau L, Plale B, Miles S, Goble C, Missier P, Barga R, Simmhan Y, Futrelle J, McGrath RE, Myers J, Paulson P, Bowers S, Ludaescher B, Kwasnikowsak N, den Bussche JV, Ellkvist T, Freire J, Groth P (2008) The open provenance model (v1.01). Available online at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.143.7208&rep=repl&type=pdf
  56. Morton TE, Pentico DW (1993) Heuristic scheduling systems: with applications to production systems and project management. Wiley, New YorkGoogle Scholar
  57. NARA (2007) Strategic directions: appraisal policy. Available online at http://www.archives.gov/records-mgmt/initiatives/appraisal.html
  58. NARA (2010) Archives and records management resources 2010. Available online at http://www.archives.gov/records-mgmt/initiatives/appraisal.html
  59. NSIDC/WDC for Glaciology (2009) Glacier photograph collection. National snow and ice data center/world data center for glaciology. NSIDC/WDC for Glaciology, BoulderGoogle Scholar
  60. Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20:3045–3054CrossRefGoogle Scholar
  61. Pashkin N (2006) The DOI handbook, Ed. 4.4.1 International DOI Foundation, Oxford. Available online at http://www.doi.org
  62. Peterson TC, Vose RS (1997) An overview of the global historical climatology network temperature database. Bull Am Meteorol Soc 78:2837–2849CrossRefGoogle Scholar
  63. Reichle RH, Koster RD, Liu P, Mahanama SPP, Njoku EG, Owe M (2007) Comparison and assimilation of global soil moisture retrievals from the advanced microwave scanning radiometer for the earth observing system (AMSR-E) and the scanning multichannel microwave radiometer (SMMR) J Geophys Res Atmos 112:D09108CrossRefGoogle Scholar
  64. Rodell M, Houser PR, Jamjor U, Gottschalck J, Mitchell K, Meng C-J, Arsenault K, Cosgrove B, Radakovich J, Bosilovich M, Entin JK, Walker JP, Lohmann D, Toll D (2003) The global land data assimilation system. Bull Am Meteorol Soc 85:381–394CrossRefGoogle Scholar
  65. Sedgewick R (1989) Algorithms, 2nd edn. Addison-Wesley, ReadingGoogle Scholar
  66. Simmhan YL, Plale B, Gannon D (2005) A survey of data provenance in e-science. SIGMOD Rec 34:31–36CrossRefGoogle Scholar
  67. Simmhan YL, Plale B, Gannon D (2006) A framework for collecting provenance in data-centric scientific workflows. In: IEEE intn’l. conf. on web services (CWS’06)Google Scholar
  68. Solomon S, Qin D, Manning M, Chen Z, Marquis M, Averyt KB, Tignor M, Miller HL (eds) (2007) Climate change, the physical science basis. In: Solomon S, Qin D, Manning M (eds) Contribution of working group I to the fourth assessment report of the intergovernmental panel on climate change contribution of working group I. Cambridge University Press, CambridgeGoogle Scholar
  69. Stein J (1966) The random house dictionary of the English language: the unabridged edition. Random House, New YorkGoogle Scholar
  70. Stonebraker M (2009) Saying good-bye to DBMSs. Commun ACM 52:12–13CrossRefGoogle Scholar
  71. Szunyogh I, Kostelich EJ, Gyarmati G, Patil DJ, Hunt BR, Kalnay E, Ott E, Yorke JA (2005) Assessing a local ensemble Kalman filter: prefect model experiments with the national centers for environmental prediction global model. Tellus A-57:528–545Google Scholar
  72. Szomszor M, Moreau L (2003) Recording and reasoning over data provenance in web and grid services. In: Meersman R et al. (eds) CoopIS/DOA/ODBASE 2003. Lecture notes in computer science, vol 2888. Springer, Berlin, pp 603–620Google Scholar
  73. Tilmes C, Fleig A (2008) Provenance tracking in an earth science data processing system. In: Freire J, Koop D, Moreau L (eds) Provenance and annotation of data and processes. Lecture notes in computer science, vol 5272. Springer, Berlin, pp 221–228CrossRefGoogle Scholar
  74. Ullman JD (1988) Principles of database and knowledge-base systems. In: Classical database systems computer, vol 1. Science, RockvilleGoogle Scholar
  75. USGCRP Program Office (1999) Global change science requirements for long-term archiving. Report of the Workshop, 28–30 Oct 1998, National Center for Atmospheric Research, Boulder. Available online at http://wiki.esipfed.org/images/4/40/USGCRP_Long-Term_Archiving.pdf
  76. Valentini M (2009) Preserving intellectual property rights in the long term: demo presented at CASPAR all hands meeting, Rome, IT 15–16 Sept 2009. Available online at www.casparpreserves.eu/training/training-lectures/10.ppt
  77. Weaver P (2006) A brief history of scheduling: back to the future. myPrimavera06, 4–6 April 2006, Canberra, AustraliaGoogle Scholar
  78. Weaver P (2007) The origins of project management. In: Fourth annual PMI college of scheduling conference, 15–18 April 2007, Vancouver, BCGoogle Scholar
  79. Wegman E, Scott DW, Said YH (2006) Ad hoc committee report on the ‘hockey stick’ global climate reconstruction. Available online as http://republicans.energycommerce.house.gov/108/home/07142006Wegman_Report.pdf. after going to http://republicans.energycommerce.house.gov/ and doing a search for “Wegman report”. Accessed 29 Sept 2009
  80. Widom J (2005) Trio: a system for integrated management of data, accuracy, and lineage. In: Proc. CIDR confGoogle Scholar
  81. Wielicki B, Barkstrom BR, Harrison EF, Lee RB III, Smith GL, Cooper JE (1996) Clouds and the earth’s radiant energy system (CERES): an earth observing system experiment. Bull Am Meteorol Soc 77:853–868CrossRefGoogle Scholar
  82. Woodruff A, Stonebraker M (1997) Supporting fine-grained data lineage in a database visualization environment report no UCB/CSD-97-932. Computer Science Division, University of California, BerkeleyGoogle Scholar
  83. World Wide Web Consortium (2009) RDF available online at http://www.w3.org/RDF/
  84. Yunck T, Wilson B, Fetzer E, Braverman A, Eldering, A, Garay, M, Manipon, G, Dobinson E, Tang B (2006) Rolling out GENESIS/SciFlo in the ESIP federation’s earth information exchange. Available online at http://esto.nasa.gov/conferences/ESTC2006/papers/a1p3.pdf

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  1. 1.AshevilleUSA

Personalised recommendations