Issues in Reproducible Simulation Research
In recent years, serious concerns have arisen about reproducibility in science. Estimates of the cost of irreproducible preclinical studies range from 28 billion USD per year in the USA alone (Freedman et al. in PLoS Biol 13(6):e1002165, 2015) to over 200 billion USD per year worldwide (Chalmers and Glasziou in Lancet 374:86–89, 2009). The situation in the social sciences is not very different: Reproducibility in psychological research, for example, has been estimated to be below 50% as well (Open Science Collaboration in Science 349:6251, 2015). Less well studied is the issue of reproducibility of simulation research. A few replication studies of agent-based models, however, suggest the problem for computational modeling may be more severe than for laboratory experiments (Willensky and Rand in JASSS 10(4):2, 2007; Donkin et al. in Environ Model Softw 92:142–151, 2017; Bajracharya and Duboz in: Proceedings of the symposium on theory of modeling and simulation—DEVS integrative M&S symposium, pp 6–11, 2013). In this perspective, we discuss problems of reproducibility in agent-based simulations of life and social science problems, drawing on best practices research in computer science and in wet-lab experiment design and execution to suggest some ways to improve simulation research practice.
KeywordsAgent-based models Simulation reproducibility Validation Test-driven development Version control Computational lab notebook
In recent years, serious concerns have arisen about reproducibility in science. Sensational reports from Amgen (Begley and Ellis 2012) and Bayer (Prinz et al. 2011) found that 47 out of 53 and 52 out of 67 preclinical studies published in high-profile journals were not reproducible. Even the more conservative estimates of problematic research in biomedicine place the rate of reproducibility at less than 50% (Freedman et al. 2015). Moreover, estimates of the cost of irreproducible preclinical studies range from 28 billion USD per year in the USA alone (Freedman et al. 2015) to over 200 billion USD per year worldwide (Chalmers and Glasziou 2009). The situation in the social sciences is not very different: Reproducibility in psychological research, for example, has been estimated to be below 50% as well (Open Science Collaboration 2015).
Less well studied is the issue of reproducibility of simulation research. As computational models become integrated into biological research and as techniques such as machine learning are adopted for drug discovery, the reliability of computational results must be investigated.
seems to suggest an even worse problem with computational reproducibility. Willensky and Rand (2007) draw close parallels between computational model building and the process of experimental science as they detail the challenges of replicating an agent-based model (ABM) from published literature. Their replication success required extensive personal interaction with the previous model’s authors. Since that effort, the “Overview, Design concept, and Details” (ODD) protocol (Grimm et al. 2005, 2010; Railsback and Grimm 2012) has provided a clear and consistent framework for model reporting. As Donkin et al. (2017) and Bajracharya and Duboz (2013) discovered, a solid ODD protocol may still not be sufficient for simulation replication. Each of their studies implemented a single ABM in distinct computational environments. In each case, serious problems with reproducibility were found, even when a single team built the same conceptual model into different software implementation.
Everyone trusts the experiment but the wind tunnel expert; no one trusts the simulation but the computational fluids expert!
2 Differences Between the Sciences and Engineering
In this perspective, we are focusing on the problem of reproducible stochastic simulations—and especially ABM simulations—in the context of life and social science applications. We should note important similarities and differences between physical sciences and engineering. Between-subject variability is perhaps the greatest distinction between physical and life sciences. Within-subject variation is also fundamentally different: The multi-scale nature of organisms (and societies) creates variation whose stochastic characterization is more challenging than that of engineering components. Engineering in the presence of turbulence in fluid flow, optical propagation, and combustion are physical and chemical problems that approach the variability challenges of life and social sciences. A key distinguishing factor between physical sciences and engineering on the one hand and life and social sciences on the other is the between-subject variability that complexifies living organisms at nearly every scale of interest. Admittedly with some exceptions, physical systems can often be reduced to components that operate with a high degree of certainty, determinism, and/or uniformity. Living systems are very difficult to reduce to such constituents, a problem that leads many scientists to prefer simulation technologies like ABMs (An et al. 2009; Railsback and Grimm 2012).
3 The Language of Simulation Reproducibility
The term “reproducibility” itself requires some refinement of definition within contexts of experimental and simulation research. For simulations, Axtell et al. (1996) suggested three levels of replication standard: numerical identity, in which simulation comparisons produce numerically identical outcomes; distributional equivalence, in which comparisons demonstrate statistical similarity in repeated simulation outcomes; and relational alignment, in which the results show qualitatively similar relationships between inputs/parameters and outcomes. Generally speaking, for ABMs that use stochastic simulation, numerical identity is too much to expect, and distributional equivalence and relational alignment are the replication standards of primary interest.
From setting criteria for attaining reproducibility, we must also clarify our language about the ABM itself. Following Willensky and Rand (2007), we say that an ABM is a dynamic simulation of a population of heterogeneous agents that obey specific rules. A conceptual model is a textual, mathematical, diagrammatic (or combination) description of the agent characterization and processes of rule-based interaction of an ABM. An implementation or operationalization is a formalization of a conceptual modeling into an executable computational format in which numerical output can be derived. Typically, implementation occurs in software.
With notions of reproducibility and model system in place, we turn to the challenges.
4 Where the Problems Lie
Time: A model reconstructed or even rerun at a different time;
Hardware: The computational hardware on which an ABM is implemented;
Languages: The software environment or programming language used to construct an implementation;
Toolkits: Programming libraries used in conjunction with language to construct an implementation;
Algorithms: Underlying mathematical processes used in conceptual and implementation; and
Authors: Individuals building ABMs, conceptual models, and implementations.
Translation: Moving from conceptual model to implementation.
Time and hardware, by themselves, are less likely to be culprits in failures to reproduce simulation results. The problems found by Donkin et al. (2017) and Bajracharya and Duboz (2013) appear to center on languages and toolkits (and perhaps algorithms on which toolkits are built). The Donkin study clearly points out the challenges of comparing NetLogo, with its high-level structures, with Repast, which involves lower-level programming. Willensky and Rand (2007) discuss in detail author and translation issues in their replication effort. All of these studies cite the utility of source code availability for replicators to create as high fidelity a facsimile as is possible.
5 Moving Forward
Two recommendations already well described in the literature involve the publication of both an ODD protocol and original source code. These are well documented in the references and their own citations. Beyond these two recommendations, we see practices in the software development and preclinical experimental research communities that may also have positive impacts on simulation reproducibility.
5.1 View the Simulation as an Experimental System
An et al. (2009, 2017) advocate that an ABM is more productively studied as a system in and of itself, rather than as a model of a system. In much the way model organisms are used to investigate problems of human health, the ABM is a “middleware” object existing between traditional mathematical models and the real system of interest. Validation of agent-based models involves a number of steps. Statistical experimental design can inform thinking about simulation development, analysis, and validation (Santer et al. 2003).
5.2 Validate in Multiple Stages
Requirements Validation: Have the model requirements been properly specified for the problem at hand?
Data Validation: Have the data used to calibrate the model been properly collected and verified?
Face Validation: Do the model assumptions and outputs appear reasonable?
Process Validation: Do the steps in model execution, agent decision, and computational flow correspond to real-world processes?
Theory Validation: Does the model make valid use of the theory on which it is based?
Agent Validation: Do agent behaviors correspond to real individual behaviors?
Output Validation: Do the model outputs compare to observed data?
The first six steps here connect closely to specifications required in the ODD protocol. The level of detail applied in these steps will certainly impact the ability of future researchers to use (and reproduce!) the model. The seventh step requires careful consideration in terms of (a) the goals of the model building exercise, (b) the replication standards that would arise in reproduction efforts, and (c) the software development process.
5.3 Don’t Document the Code: Code the Documentation
Since the computational simulations in which we are most interested involve software, we feel that clean code is a crucial step in reproducibility. Bob Martin in Clean Code (2008) notes the importance of code readability, of the structure of functions, methods, or modules, of meaningful naming. A clean code would ideally be the readable software implementation of a well-constructed ODD protocol.
5.4 Write Code to the Tests
The coding philosophy of test-driven development (TDD) translates a modeling component into a coding requirement and from there into specific test cases that must be passed by a code module (Beck 2003; Madeyski 2010; Mäkinen and Munch 2014). Code is written to pass the tests that model the requirements.
Code written in the TTD paradigm tends to be very modular and well aligned with the clean code design principles. As such, it can reinforce the North and Macal validation process. Individual code module tests can be constructed to validate execution steps, agent decisions and behaviors, and model outputs. However, the focus of TDD is more on verification (are you building the thing right?), as opposed to validation (are you building the right thing?) of the model.
5.5 Use a Version Control Repository
Development of a complex simulation, even if a single developer is responsible for all code, requires oversight and eventual dissemination of the code, testing suite, and relevant documents. GitHub (https:/github.com) is perhaps the best known platform for code sharing, version controlling, and developer collaboration. An example of such a repository is https://github.com/kdahlquist/GRNmap, a gene regulatory network modeling project.
5.6 Keep a Computational Laboratory Notebook
Long a key step in bench and field experimentation, the laboratory notebook contains a record of procedures and results. Detailing the building and running of simulations provides similar benefits to the computational experimentalist. Many institutions have policies concerning laboratory notebooks, but few (if any) do for computation. The article “Ten Simple Rules for a Computational Biologist’s Laboratory Notebook,” Schnell (2015), offers a solid set of record-keeping principles.
Complaints about recommendations such as these typically center on their time-consuming nature. For “one run and done” computational projects, our experience is that these four suggestions add considerable time and effort to project completion. However, over a multi-year project with multiple contributors (students, postdocs, research associates, etc.), having readable code with coded tests in a structured repository and a record of computational development and experimentation accelerates scientific progress dramatically. Moreover, a simulation model built on such a foundation can have much broader impact beyond the developing laboratory.
This work was partially supported by National Institute on Drug Abuse grant 1R43DA041760-01.
- Bajracharya K, Duboz R (2013) Comparison of three agent-based platforms on the basis of a simple epidemiological model (WIP). In: Proceedings of the symposium on theory of modeling and simulation—DEVS integrative M&S symposium, pp 6–11Google Scholar
- Beck K (2003) Test-Driven Development: By Example. Pearson, BostonGoogle Scholar
- Freedman LP, Cockburn IM, Simcoe TS (2015) The economics of reproducibility. PLoS Biol 13(6):e1002165. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002165. Accessed 10 June 2015
- Mäkinen S, Munch J (2014) Effects of test-driven development: a comparative analysis of empirical studies. In: Winkler D, Bifll S, Bergsmann J (eds) Software quality: model-based approaches for advanced software and systems engineering. Springer, ChamGoogle Scholar
- Martin RC (2008) Clean code: a handbook of agile software craftmanship. Pearson, BostonGoogle Scholar
- Smith, R. (2017). Personal communicationGoogle Scholar
- Wilenksy U, Rand W (2007) Making models match: replicating an agent-based model. JASSS 10(4):2Google Scholar