A mathematical framework for earth science data provenance tracing
- 152 Downloads
- 3 Citations
Abstract
-
Climate data record production, used when the data producer’s dominant concern is providing a homogeneous error structure for each data set version, particularly when the data record is expected to cover a long time period
-
Operational data set production, used when the producer must ensure low latency and service continuity with less attention to error homogeneity across the entire record
-
Exploratory production, used for validation or research in which the producer decides which processes to apply by interacting with the data. In this paradigm, there may not be a common versioning structure from one production episode to another
-
tracing the history of data production that created an item of Earth science data, with particular attention to the versioning structure of the data collections
-
tracing the history of custody for an item
-
tracing the history of Intellectual Property Rights transfers for an item
Keywords
Data provenance Data production paradigms Production history provenance Provenance of custodianship Provenance of intellectual property rightsNotes
Acknowledgements
The author is deeply grateful to the reviewers of this paper for helping him to remove a number of misconceptions and to clarify the writing.
References
- Abiteboul S, Quass D, McHugh J, Widom J, Wiener J (1997) The Lorel query language for semistructured data. Int J Digit Libr 1:1CrossRefGoogle Scholar
- Appell D (2009) Stumbling over data: mistakes fuel climate-warming skeptics. Sci Am 301:19–20CrossRefGoogle Scholar
- ASDC (2010) CERES metadata and data quality summaries, see http://eosweb.larc.nasa.gov/PRODOCS/ceres/table_ceres.html as well as http://eosweb.larc.nasa.gov/PRODOCS/ceres/level2_ssf_table.html, http://eosweb.larc.nasa.gov/PRODOCS/ceres/SSF/Quality_Summaries/CER_SSF_Aqua_Edition2C.html, http://eosweb.larc.nasa.gov/PRODOCS/ceres/SSF/Quality_Summaries/ssf_toa_aqua_ed2A.html
- Barker A, Hemert JV (2008) Scientific workflow: a survey and research directions. In: Parallel processing and applied mathematics lecture notes in computer science, vol 4967. Springer Berlin, pp 746–753Google Scholar
- Barkstrom BR (1984) The earth radiation budget experiment (ERBE). Bull Am Meteorol Soc 65:1170–1185CrossRefGoogle Scholar
- Barkstrom BR (2003) Data product configuration management and versioning in large-scale production of satellite scientific data. In: Westfechtel B, van den Hoek A (eds) Software configuration management/ICSE workshops SCM 2001 and SCM 2003, Toronto, Canada, May 2001 and Portland, OR, USA, May 2003. Lecture notes in computer science, vol 2649. Springer, Berlin, pp 118–133Google Scholar
- Barton J, Whitfield E (2005) Letter to Dr. Michael Mann dated June 23, 2005. Available online at http://republicans.energycommerce.house.gov/108/Letters/062305_Mann.pdf after going to http://republicans.energycommerce.house.gov/ and doing a search for “Letter to Dr. Mann”. Accessed 29 Sept 2009
- Baudin M (1990) Manufacturing systems analysis: with application to production scheduling. Prentice-Hall, Englewood CliffsGoogle Scholar
- Belhajjame K, Wolstencroft K, Corcho O, Oinn T, Tanoh F, William A, Goble C (2008) Metadata management in the Taverna workflow system. In: IEEE international symposium on cluster computing and the grid, pp 651–656Google Scholar
- Bose R (2002) A conceptual framework for composing and managing scientific data lineage. In: Proc. 14th international conf. on scientific and statistical database management, pp 15–19Google Scholar
- Bose R, Frew J (2004) Composing lineage metadata with XML for custom satellite-derived data products SSBDM. In: 16th international conf. on scientific and statistical database management (SSBDM’04), p 275Google Scholar
- Bose R, Frew J (2005) Lineage retrieval for scientific data processing: a survey. ACM Comput Surv 37:1–28CrossRefGoogle Scholar
- Buneman P, Suciu D (2007) Data Eng 32(special issue):1–58Google Scholar
- Buneman P, Khanna S, Tan W-C (2002) On propagation of deletions and annotations through views. In: PODS ’02: proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. Madison, Wisconsin, 3–6 June 2002Google Scholar
- Buneman P, Khanna S, Tan W-C (2002) Computing provenance and annotations for views. Workshop Paper: Workshop on Data Derivation and Provenance (Oct.), Chicago, ILGoogle Scholar
- Buneman P, Fernandez M, Suciu D (2000) UnQL: a query language and algebra for semistructured data based on structural recursion. VLDB J 9:76–110CrossRefGoogle Scholar
- Buneman P, Khanna S, Tajima K, Tan W-C (2004) Archiving scientific data. Trans Database Syst (TODS) 29:2–42CrossRefGoogle Scholar
- Buneman P, Cheney J, Tajima, Tan W-C, Vansummeren S (2008) Curated databases. In: PODS ’08: proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. Vancouver, BC, Canada, 9–12 June 2008Google Scholar
- Burroughs J (2010) Web page on quality control for the integrated global radiosonde archive. Available at http://www.ncdc.noaa.gov/oa/climate/igra/index.php
- Cane MA, Kaplan A, Miller RN, Tang B, Hackett EC, Busalacci AJ (1996) Mapping tropical Pacific sea level: data assimilation via a reduced state space Kalman filter. J Geophys Res 101(C10):22599–22617CrossRefGoogle Scholar
- CCSDS (2002) Reference model for an open archival information system (OAIS). Consultative Committee for Space Data Systems, CCSDS 650.0-B-1, Blue Book, CCSDS Secretariat, Washington, DCGoogle Scholar
- Chase RB, Aquilano NJ, Jacobs FR (1998) Production and operations management: manufacturing and services. Irwin McGraw-Hill, BostonGoogle Scholar
- Chebotko A, Lin C, Fei X, Lai Z, Lu S, Hua J, Fotouhi F (2007) VIEW: a VIsual sciEntificWorkflow management system. In: IEEE congress on services, Salt Lake City, Utah, USA, 9–13 July 2007Google Scholar
- Cheney J, Buneman P, Ludäscher B (2008) Report on the principles of provenance workshop. SIGMOD Rec 37:62–65CrossRefGoogle Scholar
- Committee on Climate Data Records from NOAA Operational Satellites (2004) Climate data records from environmental satellites. National Academies, WashingtonGoogle Scholar
- Committee on Surface Temperature Reconstructions for the past 2,000 Years (2006) Surface temperature reconstructions for the last 2,000 years. National Academies, WashingtonGoogle Scholar
- Consens MP, Mendelzon AO (1990) GraphLog: a visual formalism for real life recursion. In: PODS ’90. ACM, New York, pp 404–416CrossRefGoogle Scholar
- Conway E, Dunckley M, McIlwrath B, Giaretta D (2009) Preservation network models: creating stable networks of information to ensure the long term use of scientific data. In: Proc. PV2009, Madrid, Spain, 1–3 Dec 2009Google Scholar
- Cormen TH, Lieserson CE, Rivest RL (1997) Introduction to algorithms. MIT, CambridgeGoogle Scholar
- Cui Y, Widom J, Wiener JL (2000) Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst. 26:179–227CrossRefGoogle Scholar
- Easterling DR, Karl TR, Mason EH, Hughes PY, Bowman DP (1996) United states historical climatology network (US HCN) monthly temperature and precipitation data. ORNL/CDIAC-87, NDP-019/R3. Carbon Dioxide Information Analysis Center, Oak Ridge National Laboratory, US Department of Energy, Oak Ridge, TennesseeGoogle Scholar
- Eifrem E (2009) Neo4j—the benefits of graph databases. In: O’Reilly open source convention, 20–24 July 2009. Available online at http://en.oreilly.com/oscon2009/public/schedule/detail/8364
- ESW (2009) ESW wiki—large TripleStores. Available online at http://esw.w3.org/topic/LargeTripleStores
- Euler L (1736) Solutio problematis ad geometriam situs pertinentis. Comment Acad Sci Imper Petropol 8:128–140Google Scholar
- Fleig AJ, Tilmes C (2006) Provenance and reuse: essential elements for long term climate data sets. EOS Trans. AGU 87Google Scholar
- Foster I, Vockler J, Wilde M, Zhao Y (2002) Chimera: a virtual data system for representing, querying, and automating data derivation. In: Proc. 14th int. conf. on scientific and statistical database management, pp 37–46Google Scholar
- Frew J, Metzger D, Slaughter P (2008) Automatic capture and reconstruction of computational provenance. Concurrency Comput Pract Exper 20:485–496CrossRefGoogle Scholar
- Frew J, Bose R (2001) Earth system science workbench: a data management infrastructure for earth science products. In: Fairfax VA, Kerschberg L, Kafatos M (eds) Proc. of the 13th international conference on scientific and statistical database management (SSDBM ’01) (July). IEEE Computer Society, Washington, pp 180–189Google Scholar
- Gershwin SB (1994) Manufacturing systems engineering. PTR Prentice Hall, Englewood CliffsGoogle Scholar
- Giaretta D (2007) The CASPAR approach to digital preservation. Int J Digit Curation 2:112–131Google Scholar
- Gibbons A (1985) Algorithmic graph theory. Cambridge University Press, CambridgeGoogle Scholar
- Groth P, Jiang S, Miles S, Munroe S, Tan V, Tsasakou S, Moreau L (2006) An architecture for a provenance system: enabling and supporting provenance in grids for complex problems. Available online at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.96.3841&rep=repl&type=pdf
- Guan Z, Hernandez F, Bangalore P, Gray J, Skjellum A, Velusamy V, Liu Y (2005) Grid-flow: a grid-enabled scientific workflow system with a Petri-net-based interface. Concurrency Comput Pract Exper 18:1115–1140CrossRefGoogle Scholar
- Hook R, Romaniello M, Ullgrén M, Maisala S, Solin O, Oittinen T, Savolainen V, Järveläinen P, Tyynelä J, Péron M, Izzo C, Ballester P, Gabasch A (2006) ESO reflex: a graphical workflow engine for astronomical data reduction. Messenger 131:41–44. Available online at http://www.eso.org/sci/publications/messenger/archive/no.131-mar08/messenger-no131-42.pdf. Accessed 28 Sept 2009Google Scholar
- Jüngnickel D (1999) Graphs, networks and algorithms. Springer, BerlinGoogle Scholar
- Knuth DE (1993) The Stanford GraphBase: a platform for combinatorial computing. Addison-Wesley, ReadingGoogle Scholar
- Knuth DE (1997) The art of computer programming: vol 1. Fundamental algorithms, 3rd edn. Addison-Wesley, BostonGoogle Scholar
- Loeb NG, Wielicki BA, Doelling DR, Kato S, Wond T, Smith GL, Keyes DF, Manalo-Smith N (2009) Toward optimal closure of the earth’s top-of-atmosphere radiation budget. J Clim 22:748–766CrossRefGoogle Scholar
- Lorenc AC, Ballard SP, Bell RS, Ingleby NB, Andrews PLF, Barker DM, Bray JR, Clayton AM, Dalby T, Li D, Payne TJ, Saunders FW (2006) The met. office global three-dimensional variational data assimilation scheme. Q J Royal Meteorol Soc 126:2991–3012CrossRefGoogle Scholar
- Mann M, Bradley E, Hughes RS, Malcolm K (1998) Global-scale temperature patterns and climate forcing over the past six centuries. Nature 392:779–787CrossRefGoogle Scholar
- McIntyre S, McKitrick R (2003) Corrections to the Mann et al. proxy data base and northern hemisphere average temperature series. Energy Environ 14:751–772CrossRefGoogle Scholar
- Miles S, Groth P, Munroe S, Jiang S, Assandri T, Moreau L (2000) Extracting causal graphs from an open provenance model. Concurrency Comput Pract Exper 00:1–7Google Scholar
- MIT World (2009) The climategate debate, on-line discussion. Available online at http://mitworld.mit.edu/video/730. Accessed 3 Feb 2010
- Moradkhani H, Sorooshian S, Gupta HV, Houser PR (2004) Dual state-parameter estimation of hydrological models using Kalman filter. Adv Water Plann 26:135–147Google Scholar
- Moreau L, Groth P (2009) Open provenance challenge. Available online at http://twiki.ipaw.info/bin/view/Challenge/WebHome
- Moreau L, Plale B, Miles S, Goble C, Missier P, Barga R, Simmhan Y, Futrelle J, McGrath RE, Myers J, Paulson P, Bowers S, Ludaescher B, Kwasnikowsak N, den Bussche JV, Ellkvist T, Freire J, Groth P (2008) The open provenance model (v1.01). Available online at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.143.7208&rep=repl&type=pdf
- Morton TE, Pentico DW (1993) Heuristic scheduling systems: with applications to production systems and project management. Wiley, New YorkGoogle Scholar
- NARA (2007) Strategic directions: appraisal policy. Available online at http://www.archives.gov/records-mgmt/initiatives/appraisal.html
- NARA (2010) Archives and records management resources 2010. Available online at http://www.archives.gov/records-mgmt/initiatives/appraisal.html
- NSIDC/WDC for Glaciology (2009) Glacier photograph collection. National snow and ice data center/world data center for glaciology. NSIDC/WDC for Glaciology, BoulderGoogle Scholar
- Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20:3045–3054CrossRefGoogle Scholar
- Pashkin N (2006) The DOI handbook, Ed. 4.4.1 International DOI Foundation, Oxford. Available online at http://www.doi.org
- Peterson TC, Vose RS (1997) An overview of the global historical climatology network temperature database. Bull Am Meteorol Soc 78:2837–2849CrossRefGoogle Scholar
- Reichle RH, Koster RD, Liu P, Mahanama SPP, Njoku EG, Owe M (2007) Comparison and assimilation of global soil moisture retrievals from the advanced microwave scanning radiometer for the earth observing system (AMSR-E) and the scanning multichannel microwave radiometer (SMMR) J Geophys Res Atmos 112:D09108CrossRefGoogle Scholar
- Rodell M, Houser PR, Jamjor U, Gottschalck J, Mitchell K, Meng C-J, Arsenault K, Cosgrove B, Radakovich J, Bosilovich M, Entin JK, Walker JP, Lohmann D, Toll D (2003) The global land data assimilation system. Bull Am Meteorol Soc 85:381–394CrossRefGoogle Scholar
- Sedgewick R (1989) Algorithms, 2nd edn. Addison-Wesley, ReadingGoogle Scholar
- Simmhan YL, Plale B, Gannon D (2005) A survey of data provenance in e-science. SIGMOD Rec 34:31–36CrossRefGoogle Scholar
- Simmhan YL, Plale B, Gannon D (2006) A framework for collecting provenance in data-centric scientific workflows. In: IEEE intn’l. conf. on web services (CWS’06)Google Scholar
- Solomon S, Qin D, Manning M, Chen Z, Marquis M, Averyt KB, Tignor M, Miller HL (eds) (2007) Climate change, the physical science basis. In: Solomon S, Qin D, Manning M (eds) Contribution of working group I to the fourth assessment report of the intergovernmental panel on climate change contribution of working group I. Cambridge University Press, CambridgeGoogle Scholar
- Stein J (1966) The random house dictionary of the English language: the unabridged edition. Random House, New YorkGoogle Scholar
- Stonebraker M (2009) Saying good-bye to DBMSs. Commun ACM 52:12–13CrossRefGoogle Scholar
- Szunyogh I, Kostelich EJ, Gyarmati G, Patil DJ, Hunt BR, Kalnay E, Ott E, Yorke JA (2005) Assessing a local ensemble Kalman filter: prefect model experiments with the national centers for environmental prediction global model. Tellus A-57:528–545Google Scholar
- Szomszor M, Moreau L (2003) Recording and reasoning over data provenance in web and grid services. In: Meersman R et al. (eds) CoopIS/DOA/ODBASE 2003. Lecture notes in computer science, vol 2888. Springer, Berlin, pp 603–620Google Scholar
- Tilmes C, Fleig A (2008) Provenance tracking in an earth science data processing system. In: Freire J, Koop D, Moreau L (eds) Provenance and annotation of data and processes. Lecture notes in computer science, vol 5272. Springer, Berlin, pp 221–228CrossRefGoogle Scholar
- Ullman JD (1988) Principles of database and knowledge-base systems. In: Classical database systems computer, vol 1. Science, RockvilleGoogle Scholar
- USGCRP Program Office (1999) Global change science requirements for long-term archiving. Report of the Workshop, 28–30 Oct 1998, National Center for Atmospheric Research, Boulder. Available online at http://wiki.esipfed.org/images/4/40/USGCRP_Long-Term_Archiving.pdf
- Valentini M (2009) Preserving intellectual property rights in the long term: demo presented at CASPAR all hands meeting, Rome, IT 15–16 Sept 2009. Available online at www.casparpreserves.eu/training/training-lectures/10.ppt
- Weaver P (2006) A brief history of scheduling: back to the future. myPrimavera06, 4–6 April 2006, Canberra, AustraliaGoogle Scholar
- Weaver P (2007) The origins of project management. In: Fourth annual PMI college of scheduling conference, 15–18 April 2007, Vancouver, BCGoogle Scholar
- Wegman E, Scott DW, Said YH (2006) Ad hoc committee report on the ‘hockey stick’ global climate reconstruction. Available online as http://republicans.energycommerce.house.gov/108/home/07142006Wegman_Report.pdf. after going to http://republicans.energycommerce.house.gov/ and doing a search for “Wegman report”. Accessed 29 Sept 2009
- Widom J (2005) Trio: a system for integrated management of data, accuracy, and lineage. In: Proc. CIDR confGoogle Scholar
- Wielicki B, Barkstrom BR, Harrison EF, Lee RB III, Smith GL, Cooper JE (1996) Clouds and the earth’s radiant energy system (CERES): an earth observing system experiment. Bull Am Meteorol Soc 77:853–868CrossRefGoogle Scholar
- Woodruff A, Stonebraker M (1997) Supporting fine-grained data lineage in a database visualization environment report no UCB/CSD-97-932. Computer Science Division, University of California, BerkeleyGoogle Scholar
- World Wide Web Consortium (2009) RDF available online at http://www.w3.org/RDF/
- Yunck T, Wilson B, Fetzer E, Braverman A, Eldering, A, Garay, M, Manipon, G, Dobinson E, Tang B (2006) Rolling out GENESIS/SciFlo in the ESIP federation’s earth information exchange. Available online at http://esto.nasa.gov/conferences/ESTC2006/papers/a1p3.pdf