Why Probability isn’t Magic

“What data will show the truth?” is a fundamental question emerging early in any empirical investigation. From a statistical perspective, experimental design is the appropriate tool to address this question by ensuring control of the error rates of planned data analyses and of the ensuing decisions. From an epistemological standpoint, planned data analyses describe in mathematical and algorithmic terms a pre-specified mapping of observations into decisions. The value of exploratory data analyses is often less clear, resulting in confusion about what characteristics of design and analysis are necessary for decision making and what may be useful to inspire new questions. This point is addressed here by illustrating the Popper-Miller theorem in plain terms and using a graphical support. Popper and Miller proved that probability estimates cannot generate hypotheses on behalf of investigators. Consistently with Popper-Miller, we show that probability estimation can only reduce uncertainty about the truth of a merely possible hypothesis. This fact clearly identifies exploratory analysis as one of the tools supporting a dynamic process of hypothesis generation and refinement which cannot be purely analytic. A clear understanding of these facts will enable stakeholders, mathematical modellers and data analysts to better engage on a level playing field when designing experiments and when interpreting the results of planned and exploratory data analyses.


Introduction
The interplay of induction and deduction in science has been intensely discussed along the last four centuries (Magnani et al. 1999). Bacon (1620) prominently stated the value of methodical observation against the cognitive biases of medieval scholasticism. Hume (1739) saw that no induction from any amount of past cases can logically ensure the conformity of future cases, while recognising that induction is a main driver of hypothesis generation. Peirce (1878) introduced the concept of abduction and the related case of inference to the best explanation (IBE), describing processes by which new hypotheses are generated by relating empirical observations to domain knowledge (Magnani 2017;Magnani 2001).
Learnings from this extensive debate have only partially percolated into current experimental science, resulting in a persistent lack of clarity about the specific roles of pre-planned and exploratory data analyses. Of note, data analyses investigating causal relationships (Pearl 2009) are not considered exploratory here due to the substantial theoretical assumptions that are typically involved. Exploratory analyses are routinely used for data dimensionality reduction and visualisation (Tukey 1977;Gelman 2004;Jebb et al. 2017) especially when many measurements are collected from relatively few samples, such as in clinical research (Biesecker 2013), forensics (Aitken and Taroni 2004) and environmental sciences (Reimann et al. 2008) among others. One motivating context here is the identification of predictors of sensitivity to cancer therapies. Development of cancer drugs over the last fifty years delivered radiotherapy, chemotherapy and cancer immunotherapies which improved survival and quality of life for many, albeit not yet for all cancer patients (Chabner and Roberts 2005;Rosenberg 2014). To date, most molecular predictors of cancer response have been established by clinical confirmation of hypotheses based on preclinical experiments and on exploratory analysis of clinical data (Perez-Gracia 2017; Wilson and Altman 2018; Barker et al. 2009;Berry 2012;Yarchoan et al. 2017;Goldberg et al. 2017). Predictive, personalized, preventive, participatory (P4) cancer medicine (Hood and Friend 2011) is an emerging paradigm underpinning the development of potentially more effective cancer treatments and prevention strategies. P4 calls for a robust epistemological support to guide the interpretation of massive data pipelines currently enabling integrative analysis of DNA, RNA, protein expression and epigenetic features at single cell resolution (Stuart and Satija 2019). Specifically, it is well known that these analyses can generate false positive results even under harsh probabilistic constraints, fueling the debate on reproducibility in experimental sciences (Baker 2016;Johnson 2013;Wasserstein and Lazar 2016). This debate also harbours questions about whether probability is the best language to quantify evidence from data for decision making, as other algorithmic approaches seem to provide attractive alternatives (Breiman 2001;Langley 1995;Langley 2000).
Lack of clarity about the specific value of pre-planned and of exploratory data analyses may arise from a lack of awareness of the limitations of "large p, small N" studies (West 2003), or from broader misunderstandings of statistical inference. This issue is addressed here by illustrating the Popper-Miller theorem (Popper and Miller 1983;Rochefort-Maranda and Miller 2018) using plain language and a graphical support. Popper and Miller proved that the estimation of probabilities per se cannot generate new hypotheses, thus clearly identifying preplanned data analyses as the mathematical and algorithmic description of a hypotetico-deductive mapping of data into decisions. Popper also clearly identified exploratory data analysis as the analytic component of a process of hypothesis generation and refinement which cannot be entirely analytic as it entails human ingenuity and creativity. Consistently with Popper-Miller, we show that statistical inference can only reduce an investigator's uncertainty about the truth of a merely possible hypothesis. A critical understanding of these facts based on one simple graph will enable a more effective engagement between statisticians and other stakeholders when planning experiments and when assessing how to act on the basis of exploratory or confirmatory data analysis results.

Popper-Miller in a Nutshell
Popper-Miller relies on distinguishing possibility from probability (Hýek 2001;O'Neill and Summers 2015). A possible hypothesis is any statement which truth can be accepted or rejected based on objective measurements. A possible hypothesis becomes also probable when its likelihood of being true is quantified either as a sampling frequency or as a subjective degree of belief (Lindley 1971). Popper-Miller states that probability estimation alone cannot be hypothesis-generating because the possibility of a hypothesis is implied by the decision to estimate its truth probability, and not vice versa. Equivalently, measurements become data when related to a specific and pre-existing hypothesis. It follows that numerical representations of observations or experimental results cannot "speak by themselves" because their status as data is defined by their relation to a necessary context, established through an open-ended abductive process characteristic of human creativity (Magnani 2019).
Remarkably, Popper-Miller applies also to statistical inferences determining structural features of data analysis models, such as smoothing of time series (Murphy 2002), modelling mixture distributions (McLachlan and Peel 2000) or the identification of prognostic or predictive factors (Lee 2019). Popper-Miller applies here because estimation of specific dynamics, data clusters or associations implies the possibility of and interest for these estimates and not vice versa.
A corollary to Popper-Miller is that probability estimation is not a mechanism that by itself can inform an investigator on how to refine her current hypotheses because this step entails a statement of new possibilities (Popper and Miller 1987). Popper synthesized this argument against "probability magic" stating that "whatever we may think of induction, it certainly is not analytic" (Popper 1957). This argument reflects common practice, where probability estimates are motivated by and supplemented with dynamic and contextual factors including subject-matter expertise, assessments of the potential consequences of decisions for individuals and organisations, preferences and risk attitudes of end-users, stakeholders, regulators and decision-makers (French and Rios-Insua 2010). Although these arguments have been extensively explored (Kuhn 1962;Maio 1998;Fuller 2003), the current debate on the role of algorithms as mechanisms for unbiased discovery (Anderson 2008; Calude and Longo 2017;Langley 2019;Coveney et al. 2016) calls for further clarification 1 .

Popper-Miller in a Picture
Let t 0 mark the time when a statement H becomes a possible hypothesis for an investigator. Prior to t 0 no evidence about H is quantifiable by this investigator because she is unable to interrogate any measurement about what she has not yet conceived as possible. Here we do not describe the process of hypothesis generation occurring at time t 0 , as any such attempt would rely on context-specific arguments and on an understanding of cognitive psychology well beyond the scope of this work. After t 0 , Duhem and Quine showed that evidence about H is quantifiable relative to the auxiliary assumptions adopted as a basis for collecting and analysing data (Ariew 1984). Evidence may be sought about population frequencies, unobservable quantities or future data values (Rubin 1981). Typically, p-values are used to measure evidence against a single hypothesis, Neyman-Pearson testing is used to choose between two hypotheses (Lehmann and Romano 2005) and methods fulfilling the likelihood principle are used to quantify the truth probability of a hypothesis relative to a set of alternatives (Berger and Wolpert 1984;Royall 1997).
At t 0 a probability p 0 of H being true may or may not be included among the investigator's auxiliary assumptions, depending on whether H is thought of as merely possible or also probable prior to data collection. In practice, mere possibility of H is typical at early stages of investigation and estimates of its truth probability may be quantified at later stages. To reflect this practical distinction, Popper-Miller is illustrated here for merely possible and for probable hypotheses.

Popper-Miller for Merely Possible Hypotheses
Prior to data collection, the truth of a merely possible hypothesis H is a random variable taking the values "H is true" or "H is false" with unknown probabilities p 0 and 1 − p 0 respectively. Figure 1 shows the variance V 0 and entropy e 0 (Shannon 1948) (see Appendix) of this binary random variable, which quantify the investigator's uncertainty about the statement "H is true". Variance and entropy are concave functions symmetric about p 0 = 0.5 where their maxima V max 0 = 0.25 and e max 0 = 1 are attained. In Duhem and Quine's terminology, p 0 = 0.5 is the weakest auxiliary assumption available to the investigator at t 0 because her uncertainty about the truth of her merely possible hypothesis H is maximised. Figure 1 shows that any estimate p t about the truth of H calculated from data observed at Fig. 1 variance and entropy of the binary random variable "H is true" for a possible hypothesis H plotted against the unknown probability that the hypothesis is true prior to data collection, p 0 . Maximum uncertainty is attained at p 0 = 0.5 . Any departure from p 0 = 0.5 at t > t 0 can only decrease the investigator's uncertainty about the truth of H, showing that probabilistic support cannot be per se hypothesis generating t > t 0 can only reduce the investigator's uncertainty relative to p 0 = 0.5 , because V t ≤ V max 0 and e t ≤ e max 0 for any p t ∈ [0, 1] . Equivalently, no estimate p t can increase the investigator's uncertainty about the truth of a hypothesis above her maximum uncertainty after deeming this hypothesis merely possible. For probability estimation to be hypothesis-generating, there ought to be at least one estimate p t increasing V t or e t beyond their respective maxima V max 0 and e max 0 . However, no such estimate exists or, equivalently, "probabilistic support is not inductive" (Popper and Miller 1987).

Popper-Miller for Probable Hypotheses
When the investigator is willing to quantify p 0 , elicitation is used to map her beliefs into coherent probability statements (Garthwaite et al. 2005;O'Hagan et al. 2006). If elicitation is successful, a possible hypothesis becomes probable and the elicited value of p 0 quantifies the investigator's expectation about p t prior to collecting data. Here p 0 can be thought of as the proportion of exchangeable experiments within the design space expected to show that the possible hypothesis H is true. Given p 0 , the investigator may conduct her study and estimate p t using Bayes'theorem (see Appendix) (Robert 2007;Bernardo and Smith 2000;O'Hagan and Forster 2004;. When elicitation yields p 0 = 0.5 , Figure 1 shows that the investigator's uncertainty about the truth of H cannot be increased by any Bayesian estimate p t . When p 0 ≠ 0.5 it is possible that V 0 ≤ V t ≤ V max 0 and e 0 ≤ e t ≤ e max 0 due to p t being closer to 0.5 than p 0 , manifesting prior-data conflict (Evans and Moshnov 2006). Also in this case, no Bayesian estimate can increase the investigator's uncertainty about the truth of H above its maximum attained at p t = 0.5 and representing her belief in the mere possibility of H. Since no Bayesian estimate can have any bearing on the possibility of H no matter how different p 0 and p t might be, Figure 1 shows that Bayesian probability estimation cannot be a stand-alone algorithm for hypothesis generation (Gelman and Shalizi 2013).

Discussion
A simple graphical tool is provided to show why probability estimation cannot be hypothesis generating per se, which is the essence of the Popper-Miller theorem. A critical understanding of this fact can further motivate scientists to "experimenting with experiments", taking full advantage of long-standing and recent results in experimental design (Fisher 1935;Cox and Reid 2000;Steinberg 2014). Even more importantly, decision-makers can use this graphical tool to show that elements beyond retrospective analysis of any complexity are necessarily involved in hypothesis generation and to require that these elements are transparently described when evaluating any validation or extrapolation strategy. A waning awareness of Popper-Miller, especially when exploring high-dimensional data using innovative algorithms, may result in weak decision making and in misperception of the role of probability estimation and data analysis in empirical research. Specifically, shared ownership of hypothesis generation between data scientists and subject matter experts cannot confer confirmatory value to exploratory analysis.
A common avenue to maintain and promote awareness about the specific values of exploratory and confirmatory analyses is to engage investigators and decision makers in the determination of the error rates of decisions to be informed by the analysis results. For instance, elicitation of loss function components from decision makers naturally leads to a prospective definition of what a data analysis ought to deliver (e.g. March and Shapira 1987;Smith 2010). However, this engagement requires a commitment to decision rules seldom specified in sufficient detail. Many investigators will then seek for clarity about what probabilistic support can be afforded in more common conditions. These can be broadly classified between instances where data sampling and analysis strategies are dynamically updated along a study while its main objectives remain unchanged, and studies which objectives are changed during their conduct. In the first scenario, methods for sample size re-estimation and for sequential analysis can be used to protect decisions from foreseeable errors (Bothwell et al. 2018;Chuang-Stein et al. 2006;Esserman et al. 2018;Le Tourneau et al. 2009). In the second scenario, the original data sampling design may be inappropriate to inform the new objectives. Exploratory analysis of data generated thus far may show what data source may inform the new hypotheses to be investigated in the reminder of the study. In both scenarios, Popper-Miller shows that hypotheses are first generated in the mind of investigators and then tested through the mechanics of data analysis algorithms or that, equivalently, "we need both exploratory and confirmatory" analysis (Tukey 1980).

Conflict of interest
The author declares no conflicts of interest.
(4) p t 1 − p t = p 0 1 − p 0 P(data | H is true) P(data | H is false) . Figure 1 was generated using R (https:// cran.r-proje ct. org/). The code is available upon request to the Author.

Code availability
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.