Encyclopedia of Database Systems

Living Edition
| Editors: Ling Liu, M. Tamer Özsu

Provenance and Reproducibility

  • Fernando ChirigatiEmail author
  • Juliana Freire
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4899-7993-3_80747-1
  • 118 Downloads

Keywords

Source Code Computational Experiment Computational Step Operating System Level Reproducible Research 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Synonyms

Definition

A computational experiment composed by a sequence of steps S created at time T, on environment (hardware and operating system) E, using data D is reproducible if it can be executed with a sequence of steps S′ (modified from or equal to S) at time T′ > T, on environment E′ (potentially different than E), using data D′ that is similar to (or the same as) D with consistent results [5]. Replication is a special case of reproducibility where S′ = S and D′ = D. While there is substantial disagreement on how to define reproducibility [1], in particular across different domains, in this entry, we focus on computational reproducibility, i.e., reproducibility for computational experiments or processes.

The information needed to reproduce an experiment can be obtained from its provenance: the details of how the experiment was carried out and the results it derived. For computational experiments, provenance can be systematically and transparently captured. In addition to enabling reproducibility, provenance allows results to be verified, i.e., one can determine whether the experiment solves the problem it claims to solve, which helps identify common issues and bias in research such as p-hacking. Provenance also describes the chain of reasoning used to derive the results, which provides additional insights when the findings are reproduced or extended for other experiments.

Historical Background

The concept of reproducible research for computational experiments is similar in nature to the ideas on literate programming introduced by Knuth [6]. Claerbout and his colleagues coined the term reproducible research [3] and pioneered the adoption of reproducibility in geophysics research.

The ability to reproduce experiments is a requirement in the scientific process and the benefits of reproducibility have long been known. Readers (consumers) can validate and compare methods published in the literature, as well as build upon previous work. In fact, reproducibility may increase the value of subsequent work: if a better method is developed, and the original paper is reproducible, it is possible for researchers to quantify the validity of their own ideas in comparison to statements made in the original publication. Authors (producers) also benefit in multiple ways. Making an experiment reproducible forces the researcher to document execution pathways. This in turn enables the pathways to be analyzed (and audited). It also helps newcomers to get acquainted with the problem and tools used. Furthermore, reproducibility forces portability which simplifies the dissemination of the results. Last, preliminary evidence exists that reproducibility increases impact, visibility, and research quality [10, 11] and helps defeat self-deception [9].

Although a standard in natural science and in Math, where results are accompanied by formal proofs, the task of reproducibility has not been widely applied for results backed by computational experiments. Scientific papers published in conferences and journals present a large number of tables, plots, and beautiful pictures that summarize the obtained results but that loosely describe the steps taken to derive them. Not only can the methods and the implementation be complex, but their configuration may require setting many parameters. Consequently, reproducing the results from scratch is both time-consuming and error-prone and sometimes impossible [8]. As a matter of fact, studies have shown that the number of scientific publications that can be effectively reproduced is far from ideal [2, 4, 7, 11].

The credibility crisis in computational science has led to many efforts aimed at establishing the publication of reproducible results the norm. Funding agencies in the USA, top-tier conferences and journals, and academic institutions have started to encourage and sometimes require authors to create reproducible experiments, i.e., to publish data and code needed to reproduce results presented in publications. Many different techniques and open-source tools have been developed to facilitate the practice of reproducible research.

Scientific Fundamentals

Provenance Components for Reproducibility. To enable reproducibility, provenance should be executable, and include input data, specification of experiment, and specification of environment.

The input data includes the data D used in the experiment, parameters, and additional variables, which must be included in extension (e.g., a text file) or in intension (e.g., a script that generates the data). It may be also useful to include intermediate and output data, in case some steps of the experiment cannot be fully reproduced (e.g., data generated by a third-party software that cannot be openly distributed). Specification of experiment relates to a description of the experiment and the set of computational steps S executed. Examples of specifications include scripts that glue together the different pieces of an experiment and workflows created by authors using scientific workflow systems. The specification of environment consists of information about the computational environment E where the experiment was originally created and executed. It includes the operating system (OS), hardware architecture, and library dependencies.

Gathering and managing this provenance is challenging because experiments tend to use different software systems, library dependencies, programming languages, and platforms. In addition, the specification of environment may not be trivial to capture, as operating systems and dependencies are often complex and have many stateful components. Tools are available that simplify this process (see below).

Granularity. Provenance can be captured at different levels of granularity. OS-based capture works at the OS level, i.e., capturing system calls, kernel information, and computational processes. Therefore, a detailed description with respect to the specification of environment can be captured. However, because this provenance is fine-grained, it may be hard to reconcile it with the different computational steps of the experiment and to provide a high-level, human-readable representation. Workflow-based capture uses a scientific workflow system to track and store the computational steps (wrapped in modules) and their data dependencies, but information about the environment is rarely gathered. Code-instrumented capture entails instrumenting the source code, automatically or manually through annotations, to capture the provenance components. Similar to workflows, the input data and specification of experiment are captured, but not the environment specification.

Axes of Reproducibility. Computational reproducibility may come in different levels depending on how much provenance information is captured. Full reproducibility, although desirable, may be hard or impossible to attain. To characterize these different levels, three axes of reproducibility were introduced [5]: transparency level, portability, and coverage.

The transparency level considers how much information with respect to data and computational steps is available for an experiment. This relates to the capture of input data and specification of experiment. The default level today is to provide a document with a set of figures (and captions) that represent the computational results. The level increases by providing (i) the input data D and intermediate results derived by the experiment, (ii) the binaries used to generate the figures included in the document, (iii) a high-level, human-readable description of the experiment (e.g., a workflow), and (iv) the source code, which allows further exploration of the experiment.

Portability is related to the ability to run an experiment in different environments and depends on the available specification of the environment. Computational results may be reproduced (i) on the original environment E, (ii) on similar environments (i.e., compatible operating system but different machines), or (iii) on different environments (i.e., different operating systems and machines).

Coverage takes into account how much of the experiment can be reproduced, i.e., if the experiment can be (1) partially or (2) fully reproduced. Many experiments cannot be fully reproduced, including experiments that rely on data derived by third-party Web services or special hardware or that require nondeterministic computational processes. But such experiments can, sometimes, be partially reproduced: a sequence of steps S′ ⊂ S is often available. For example, if the data derived by a special hardware is made available, although the data derivation step cannot be reproduced, the subsequent analysis that depends on this data can still be performed. The level of coverage is higher if more detailed provenance (for all the components) is captured.

Note that these axes, although related, are independent from each other. An experiment may have a high coverage (e.g., data for all the steps is available) but a low transparency (e.g., no source code or binaries are available) or vice versa. Besides, even with high coverage and transparency, the experiment may still not be portable (e.g., if only binaries are made available, it is hard to run in other operating systems).

Reproducibility Modes. One can plan for reproducibility and capture provenance as an experiment is carried out, or attain it after the fact – after the experiment is completed. The former enables the capture of the entire history of the experiment, which helps understand the chain of reasoning behind all the steps. In the latter, the nature of the experiment determines how to capture the provenance for reproducibility purposes. If the source code is available, it is possible to instrument the code to capture the required information. On the other hand, if only binaries are provided, provenance can be captured at the OS level.

Key Applications

Scientific Workflow Systems (SWS). In these systems, provenance capture is tied to the workflow definition (workflow-based capture). They capture and store the input data as well as the specification of experiment as a workflow, together with a description of the modules that compose this workflow. However, the specification of environment is rarely captured: workflows usually do not include in their bundles precise information about the computational environment (e.g., library dependencies), which hampers portability. Coverage will depend on how much of the experiment is represented in the workflow. The level of transparency often depends on how the underlying computational steps are wrapped for integration with SWS. Since modules are black boxes, the more information is abstracted within a module, the lower the transparency is. Examples of workflow systems include VisTrails, Taverna, Kepler, and Galaxy.

Programming Tools and Software Packages. Tools are available for standard programming environments that can gather provenance without requiring researchers to port (or wrap) their experiment to other systems such as workflow systems. Some of them involve instrumenting the source code for the provenance capture (code-instrumented capture). The captured provenance includes input data and specification of experiment, but it often lacks full details for the specification of environment. Examples include Sumatra, noWorkflow, and ProvenanceCurious.

Literate programming tools encompass another class of tools relevant to this category: they interleave source code and text in natural language, which allows users to integrate code fragments with graphical, numerical, and textual output. Similar to the previous tools, they often capture input data and specification of experiment, but portability cannot be entirely guaranteed. Sweave and Dexy are examples of such tools. Last but not least, interactive notebooks are closely related to literate programming tools, but instead of requiring a separate compilation step, they are based on interactive worksheets. Examples include Jupyter and Sage.

Packaging Applications. These tools create executable compendiums that allow experiments to be reproduced in environments E′ potentially different from the original environment E, usually having high coverage, transparency, and portability. Linux-based packing tools, such as CDE and ReproZip, use OS-based capture to obtain all the three provenance components related to an experiment. Using the captured information, these tools create an executable package for the experiment that can be used to reproduce the original results.

Executable Documents. A number of tools have been developed to aid researchers in creating executable documents, which can be defined as provenance-rich files that, in addition to text, also include the computational objects (e.g., code and data) used to generate the experimental results of a publication. They can blend static and dynamic content, and are often represented by HTML pages or enhanced PDF files, allowing researchers to interact and reproduce the findings described by authors. The creation of executable documents is a feature of many tools, such as VisTrails, Galaxy, and most literate programming tools and interactive notebooks. Other tools (e.g., Janiform) are entirely focused on producing these documents: they do not capture provenance but they encapsulate the provenance within documents.

Repositories. Repositories have been created to host, store, and share computational experiments. Although provenance capture is often not supported, these repositories aim at preserving the provenance components to allow experiments to be maintained and reproduced long after they were created. Repositories such as myExperiment, Dataverse, DataONE, and figshare support the archival of experiments that can be downloaded at any time by researchers, while others such as crowdLabs also allow results to be reproduced through a Web browser.

URL to Code

Cross-References

Recommended Reading

  1. 1.
    Baker M. Muddled meanings hamper efforts to fix reproducibility crisis. Nature News & Comment. 14 Jun 2006 (2016).Google Scholar
  2. 2.
    Bonnet P, Manegold S, Bjørling M, Cao W, Gonzalez J, Granados J, Hall N, Idreos S, Ivanova M, Johnson R, Koop D, Kraska T, Müller R, Olteanu D, Papotti P, Reilly C, Tsirogiannis D, Yu C, Freire J, Shasha D. Repeatability and workability evaluation of SIGMOD’2011. SIGMOD Rec. 2011;40(2):45–8.CrossRefGoogle Scholar
  3. 3.
    Claerbout J, Karrenbach M. Electronic documents give reproducible research a new meaning. In: Proceedings of the 62nd Annual International Meeting of the Society of Exploration Geophysics; 1992. p. 601–4.Google Scholar
  4. 4.
    Collberg C, Proebsting T, Warren AM. Repeatability and benefaction in computer systems research. Technical report. TR 14-04, University of Arizona; 2015.Google Scholar
  5. 5.
    Freire J, Bonnet P, Shasha D. Computational reproducibility: state-of-the-art, challenges, and database research opportunities. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD’12. ACM, New York; 2012. p. 593–6.Google Scholar
  6. 6.
    Knuth DE. Literate programming. Comput J. 1984;27(2):97–111.CrossRefzbMATHGoogle Scholar
  7. 7.
    Kovacevic J. How to encourage and publish reproducible research. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP; 2007, vol. 4, p. IV-1273–6.Google Scholar
  8. 8.
    LeVeque R. Python tools for reproducible research on hyperbolic problems. Comput Sci Eng. 2009;11(1):19–27.CrossRefGoogle Scholar
  9. 9.
    Nuzzo R. How scientists fool themselves, and how they can stop. Nature. 2015;526(7572):182–5.CrossRefGoogle Scholar
  10. 10.
    Piwowar HA, Day RS, Fridsma DB. Sharing detailed research data is associated with increased citation rate. PLoS One. 2007;2(3):e308.CrossRefGoogle Scholar
  11. 11.
    Vandewalle P, Kovacevic J, Vetterli M. Reproducible research in signal processing – what, why, and how. IEEE Signal Process Mag. 2009;26(3):37–7.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media LLC 2017

Authors and Affiliations

  1. 1.NYU Tandon School of EngineeringBrooklynUSA
  2. 2.NYU Tandon School of EngineeringBrooklynUSA
  3. 3.NYU Center for Data ScienceNew YorkUSA

Section editors and affiliations

  • Juliana Freire
    • 1
  1. 1.University of UtahSalt Lake CityUSA