# The Crisis of Reproducibility, the Denominator Problem and the Scientific Role of Multi-scale Modeling

**Part of the following topical collections:**

## Abstract

The “Crisis of Reproducibility” has received considerable attention both within the scientific community and without. While factors associated with scientific culture and practical practice are most often invoked, I propose that the Crisis of Reproducibility is ultimately a failure of generalization with a fundamental scientific basis in the methods used for biomedical research. The Denominator Problem describes how limitations intrinsic to the two primary approaches of biomedical research, clinical studies and preclinical experimental biology, lead to an inability to effectively characterize the full extent of biological heterogeneity, which compromises the task of generalizing acquired knowledge. Drawing on the example of the unifying role of theory in the physical sciences, I propose that multi-scale mathematical and dynamic computational models, when mapped to the modular structure of biological systems, can serve a unifying role as formal representations of what is conserved and similar from one biological context to another. This ability to explicitly describe the generation of heterogeneity from similarity addresses the Denominator Problem and provides a scientific response to the Crisis of Reproducibility.

## Keywords

Crisis of Reproducibility Multi-scale models- 1.
The Crisis of Reproducibility is a by-product of a faulty incentive structure in the current science/academic environment that overly rewards positive results, prompts overstated claims and propagates an “advocacy” mindset that is in opposition to the fundamental role of skepticism in science. The reductio ad absurdum of this position is scientific fraud, which most believe is not widespread; there is a recognition that faulty incentive have a significant role even absent that extreme.

- 2.
The Crisis of Reproducibility is a by-product of sloppy or un-rigorous science, both technically and intellectually, that is further incentivized by #1. Solutions from this branch of opinions gravitate toward the establishment of more rigorous standards for transparency in reporting and publishing methods and results.

While the “societal” factors may certainly play a role in the Crisis of Reproducibility, I have an alternative hypothesis: that the Crisis of Reproducibility is a fundamental epiphenomenon of how biological “science” is carried out, specifically related to a lack of recognition regarding how existing methods do not account for fundamental properties of biological systems. Therefore, I claim that there is a scientific reason for the Crisis of Reproducibility intrinsic to the current process of biomedical research, and that multi-scale mathematical and dynamic computational modeling provides a means of overcoming this deficit.

- 1.
The Translational Dilemma or “Valley of Death” in drug development (Butler 2008), which refers to the inability to reliably translate knowledge obtained at the preclinical experimental level into clinically effective therapeutics.

- 2.
Personalized or Precision Medicine (Collins and Varmus 2015), which purports to account for individual–individual variability in the clinical setting by attempting to identify particular patient-disease characteristics associated with improved responsiveness to existing drugs or other treatment modalities. This process is most notably applied in the area of oncology, where tumor genotypes are correlated with presumptively more effective drug combinations [though it is apparent that, based on existing efficacy and applicability, this approach is still in evolution (Marquart et al. 2018)].

I propose that all these issues arise from the same source: a lack of recognition of the Denominator Problem in biomedical research. *I define the Denominator Problem as to the inability to effectively characterize the* “*denominator*” *of a biosystem being studied, where the* “*denominator*” *is defined as the population distribution of the total possible behavior/state space, as described by whatever metrics chosen, of that biosystem.* The concept of a system’s denominator is critical since it is directly tied to the process of generalization of knowledge, a fundamental goal of Science that aims to formally express what is similar and conserved from one observational context to another. The Denominator Problem arises when attempting to answer the question: “When is what I learn from a subset of all possible outcomes of a system generalizable to the system as a whole?” The process of generalization involves dealing with the Problem of Induction: the ability to make reliable statements about a particular phenomenon based on some sampling of that phenomenon.

*theory*-

*based*approach to science. The other means of addressing the Problem of Induction is through the development and use of statistics. Statistics has evolved as an empirically based means of determining the reliability of generalizing statements, but its application requires knowing the relationship between the empirical sampling and the range of possible phenomena being evaluated, e.g., the denominator set reflecting the space of intended generalization. Therefore, in the absence of theory, achieving the generalizing aim of Science depends upon the reliability of the denominator space of the phenomenon being examined. The use of statistical tools invariably requires an initial assumption about the nature of a system’s underlying total population distribution of observables. When such assumptions have a high degree of confidence, as in the assumption of a normal distribution when the conditions of the Central Limit Theorem are met, characterizing the denominator is not a problem, and traditional statistical tools are very effective. The Denominator Problem, however, occurs when those assumptions cannot be made. I assert that the Denominator Problem in biomedicine arises from a lack of recognizing the full consequences of biological heterogeneity as manifestation of dynamic multi-scale biological processes. The generation of multi-scale biological heterogeneity leads to a vast representational gulf between the empirically collected world of observables (data) and the inferences made regarding the generative processes and structures that produce those observations; this gulf reflects an issue of sparsity arising from limitations in how biological data is produced/collected. A schematic of this problem is depicted in Fig. 1, which shows the relationship between the set of biological possibility (e.g., the denominator of the system),

*, versus both an inferred structure of that space based on either empirical sampling*

**A***[i.e., clinical data sets versus the range of possible physiological states, for an example within a research domain see (Button et al. 2013)], and the range of particular possibilities investigated by experiment,*

**B***and*

**C1***(see Refs. Richter 2017; Richter et al. 2009).*

**C2**The limitations of biological investigation have both a legacy and logistical component [for extensive discussion see Ref. (Vodovotz and An 2014)]. Historically, biology has been an almost completely empirical endeavor, with an emphasis on description at greater and greater levels of detail. Zoology and botany both manifest this property, where similarity and difference are categorized by increasingly subtle descriptions of various features or components of an organism. There is a belief that increasingly detailed information about a particular system equates to increased knowledge of the system, i.e., a quest for *microstate* characterization. The emphasis on various—omics technologies to classify biological states or entities today is a direct result of this paradigm. However, microstate characterization, when used to determine a baseline probability distribution of the denominator of r thusly described (e.g., microstate-based) phenotype space cannot be assumed to have a normal distribution since the variables across the system are no independent and thus violate the conditions of the Central Limit Theorem. Note that this assertion only related to the denominator distribution if viewed as a list of microstate variables, as in the case in various—omics profiles or attempts at biomarker panel discovery. This is not the case for the population of higher-order phenotypes/observables that incorporate and manifest the multi-scale transition from microstate to macrostate (i.e., the normal distribution of human height, for instance). However, with the desire to increase the granularity of state description to the—omics level, there is a loss in the ability to infer the shape of the denominator distribution for the reason described above. Defining the shape of the denominator space through real-world population sampling is further compromised by the sheer logistical challenges of acquiring “good” empirical data: the need for quality control of the collected data produces a paradox in that with better precision of characterization of the data (e.g., “high quality”) the smaller that data set becomes, further degrading the general representative capability of that data; as such Region * B* in Fig. 1 shrinks further compared to

*(Button et al. 2013).*

**A**At this point it is critical to recognize the importance and role of biology’s foundational theory, *evolution*, its effect on the structure it produces, and the ability to characterize them. Biology is inherently multi-scale, with organizational levels that encompass highly diverse yet somewhat redundant processes. The leads to the case where there is virtually no path-uniqueness between microstates and higher-order phenotypes/observable; in fact, evolution could not work otherwise. This “bowtie” architecture (Csete and Doyle 2004; Doyle and Csete 2011) produces a modular organization for biology that allows the persistence of a diversity of potential in the face of natural selection optimizing on phenotypic fitness. The lack of path-uniqueness in how generative microstates produce phenotypic macrostates means that traditional reductionist methods of system characterization insufficiently capture the true diversity (and therefore the effective denominator space in terms of microstate) of a biological system: what is identified through classical reductionism is only a sliver of plausible solutions/outcomes. This dynamic is reflected in Fig. 1, where * C1* and

*are intentionally draw narrower than*

**C2***or*

**A***to illustrate an intrinsic inability of reductionist experimental biology to effectively reflect system denominator space. There are inherent constraints imposed by what constitutes a “good experiment” on the ability of those experiments to approach representing A, namely the need to produce highly controlled conditions that limit variability in order to strengthen the statistical power of experimental results. In other words, these studies explicitly limit the range of possible phenotypes represented with the experimental system for the sake of a more definitive conclusion arrived at through the specific experimental preparation. While this paradigm is essential for experimental biology, but its very nature it produces the preconditions of the Crisis of Reproducibility: experimental results may be statistically valid with respect to the very specific context in which they were generated, but this increased precision comes at the expense of increased sensitivity to subtly different conditions that may lead to disparate results (Richter 2017). A practical manifestation of this phenomenon is seen in the extreme sensitivity to initial conditions in experimental biology, and is illustrated by the following situation familiar to anyone who has run a basic science research laboratory: a previously reliable and reproducible experimental model needs to be nearly completely recalibrated when either a new laboratory tech/post-doc/grad-student starts, or when the bedding/reagent supplier changes, or even the air or water filters for the laboratory are changed. In fact, this sensitivity to initial/experimental conditions is exactly how the Crisis of Reproducibility came to be recognized, as meticulously followed experimental procedures intended to replicate prior results were unable to do so (Begley and Ellis 2012). The limitations of experimental biology are further accentuated by the fact that the knowledge extracted from these experiments is focused on component-based description, e.g., state identification. Such an approach necessarily results in systems that are viewed, at best, as a series of static snapshots of their component configuration (i.e., metabolic state, gene/mRNA/protein expression levels, receptor levels/types, histological features, biomarker panels, etc.), but without any explicitly described process linking these various snapshots.*

**B***, where there is virtually no limit to the sample size acquired, while encompassing and expanding on the representational capacity of Letters*

**B***and*

**C1***, to produce a denser and more expansive baseline population distribution (e.g., the denominator space) of the system being studied (see Fig. 3). The distribution of the denominator space is defined by the sum of the trajectories generated from the parameter space of the model, which incorporates stochasticity (both epistemic and aleatory). Thus, observed biological heterogeneity is a manifestation of the parameter space of an underlying MSM, and advances in computational power allow the scale of simulation experiments needed to more fully characterize the denominator of any biological system being studied.*

**C2**As an example, we have applied these concepts in a preliminary fashion to the problem of sepsis (Cockrell and An 2017). We used a previously validated agent-based model (ABM) that simulates the innate immune response as a proxy system to characterize the denominator space of that system’s response dynamics to infectious insults. We utilized high-performance computing to perform extensive parameter space characterization of the ABM by simulating over 66 million simulated patient trajectories, each trajectory made up of up to 216,000 time points representing 90 days of simulated time, and in so doing were able to define a region of parameter space consistent with plausible biological behavior, essentially creating a 1st approximation of the “true” denominator space of sepsis. As a comparison, there have been roughly 30,000 patients enrolled in all the reported clinical trials for sepsis (Buchman et al. 2016), data that exists in general at a few time points and a limited set of observables. Furthermore, the generated state space for sepsis demonstrated a complex, multi-dimensional topology requiring the development of a novel metric for characterization [Probabilistic Basins of Attraction or PBoA (Cockrell and An 2017)], and definitely not consisting of a normal distribution, or one that could have reasonably been inferred a priori. Further analysis of this near comprehensive behavior space characterization demonstrated that any attempt to develop a state-based sampling strategy that would be predictive of system outcome, the motivation behind biomarker discovery, was futile (Cockrell and An 2017). This suggests that attempts for reliable forecasting sepsis trajectories would have to adopt some means of model-based, non-linear classifier. Ongoing work in this area is aimed at utilizing advanced learning and optimization techniques to discover both forecasting classifiers and multi-modal control strategies (Cockrell and An 2018; Petersen et al. 2018).

- 1.
The shape of the denominator space for a microstate characterization for a complex biological system cannot be assumed.

- 2.
Insufficient sampling of that denominator space will lead to an inability to generalize knowledge generated from that sampling, i.e., an inability to either translate knowledge from one context to another, or reproduce one experiment in another context.

- 3.
The paradigm/requirements of experimental biology serve to constrain the sampling of denominator space. This same paradigm leads to extreme sensitivity to conditions and irreproducibility.

- 4.
The logistical barriers of clinical research limit the density/granularity of representation and characterization denominator space (e.g., sparsity).

- 5.
Multi-scale models can and need to be used as proxy systems that can both bind together experimental knowledge (e.g., link denominator spaces generated by reductionist experiments to define the shape of the overall space) and “fill in the gaps” of clinical data (e.g., increase the density coverage of the denominator space).

There is, however, a significant cautionary note with respect to the use of MSMs for this unifying purpose. Specifically, an emphasis on precise and detailed prediction as a means of judging the adequacy and validity of MSMs generates the same limitations that afflict experimental biology: trying to enhance precision by reducing or eliminating output variability or noise functions to restrict the denominator space represented by the MSM and reduces its generalizing capability. This phenomenon most often manifests with over-fit and tightly parameterized, brittle models; therefore, overcoming this trap requires employing the concept of using widely bounded parameter spaces as part of the description of a MSM, and utilizing experimental/clinical data that incorporate outliers that reflect a sparse sampling of the wider behavioral denominator space.

In this fashion, MSMs serve a critical fundamental role in the scientific process: they are able to generate the data seen in different contexts (be it in different experiments or clinical situations) using the same functional structure through either expansion of parameter space or stochastic processes. As such, these models function as (t)heories that encapsulate what can be considered “similar” between ostensibly different biological systems or individuals, thereby obviating the “crisis” of obtaining robust, scalable and generalizable knowledge in biomedical research. With a wider adoption of this approach to using MSMs, it is hoped that perhaps new, more powerful mathematical formalisms will be identified that are able to more effectively depict biological systems in all their richness. In the meantime, however, I suggest that the use of MSMs as described above can bring a much needed level of formalism to determining what is and is not similar between different biological systems, be it across the translational divide (to address the Valley of Death) or between individuals for “true” Precision Medicine, which should mean the right drugs in the right combinations for the right patients at the right time. Having trustworthy formal functional representations that can effectively capture biological heterogeneity is a necessary step in being able to apply true engineering principles to identifying strategies to control pathophysiological processes back to a state of health.

## Notes

### Acknowledgements

This work is supported by funding from the National Institutes of Health/National Institute of General Medical Sciences Grant 1S10OD018495-01.

## References

- Bailoo JD, Reichlin TS, Würbel H (2014) Refinement of experimental design and conduct in laboratory animal research. ILAR J 55(3):383–391CrossRefGoogle Scholar
- Baker M (2016) 1,500 scientists lift the lid on reproducibility. Nat News 533(7604):452CrossRefGoogle Scholar
- Begley CG, Ellis LM (2012) Raise standards for preclinical cancer research. Nature 483:531CrossRefGoogle Scholar
- Buchman TG et al (2016) Precision Medicine for Critical Illness and Injury. Crit Care Med 44(9):1635–1638CrossRefGoogle Scholar
- Butler D (2008) Translational research: crossing the valley of death. Nature News 453(7197):840–842CrossRefGoogle Scholar
- Button KS et al (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci 14(5):365CrossRefGoogle Scholar
- Cockrell C, An G (2017) Sepsis reconsidered: identifying novel metrics for behavioral landscape characterization with a high-performance computing implementation of an agent-based model. J Theor Biol 430:157–168CrossRefGoogle Scholar
- Cockrell RC, An G (2018) Examining the controllability of sepsis using genetic algorithms on an agent-based model of systemic inflammation. PLoS Comput Biol 14(2):e1005876CrossRefGoogle Scholar
- Collins FS, Varmus H (2015) A new initiative on precision medicine. N Engl J Med 372(9):793–795CrossRefGoogle Scholar
- Csete M, Doyle J (2004) Bow ties, metabolism and disease. Trends Biotechnol 22(9):446–450CrossRefGoogle Scholar
- Doyle JC, Csete M (2011) Architecture, constraints, and behavior. Proc Natl Acad Sci 108(Supplement 3):15624–15630CrossRefGoogle Scholar
- Drucker DJ (2016) Never waste a good crisis: confronting reproducibility in translational research. Cell Metab 24(3):348–360CrossRefGoogle Scholar
- Fanelli D (2018) Opinion: Is science really facing a reproducibility crisis, and do we need it to? Proc Natl Acad Sci 115(11):2628–2631CrossRefGoogle Scholar
- Freedman LP, Cockburn IM, Simcoe TS (2015) The economics of reproducibility in preclinical research. PLoS Biol 13(6):e1002165CrossRefGoogle Scholar
- Frye SV et al (2015) Tackling reproducibility in academic preclinical drug discovery. Nat Rev Drug Discov 14(11):733CrossRefGoogle Scholar
- Ioannidis JP (2005) Why most published research findings are false. PLoS Med 2(8):e124CrossRefGoogle Scholar
- Jarvis MF, Williams M (2016) Irreproducibility in preclinical biomedical research: perceptions, uncertainties, and knowledge gaps. Trends Pharmacol Sci 37(4):290–302CrossRefGoogle Scholar
- Kaiser J (2017, November 18) Rigorous replication effort succeeds for just two of five cancer papers. Retrieved from http://www.sciencemag.org/news/2017/01/rigorous-replication-effort-succeeds-just-two-five-cancer-papers
- Marquart J, Chen EY, Prasad V (2018) Estimation of the percentage of us patients with cancer who benefit from genome-driven oncology. JAMA Oncology 4(8):1093–1098CrossRefGoogle Scholar
- Peng R (2015) The reproducibility crisis in science: a statistical counterattack. Significance 12(3):30–32CrossRefGoogle Scholar
- Petersen BK et al (2018) Precision Medicine as a control problem: Using simulation and deep reinforcement learning to discover adaptive, personalized multi-cytokine therapy for sepsis. arXiv, 2018. arXiv:1802.10440 [cs.LG]
- Prinz F, Schlange T, Asadullah K (2011) Believe it or not: how much can we rely on published data on potential drug targets? Nat Rev Drug Discovery 10(9):712CrossRefGoogle Scholar
- Richter SH (2017) Systematic heterogenization for better reproducibility in animal experimentation. Lab Animal 46:343CrossRefGoogle Scholar
- Richter SH, Garner JP, Würbel H (2009) Environmental standardization: cure or cause of poor reproducibility in animal experiments? Nat Methods 6(4):257CrossRefGoogle Scholar
- Vodovotz Y, An G (2014) Translational systems biology: concepts and practice for the future of biomedical research. Elsevier, AmsterdamGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.