# For and Against Methodologies: Some Perspectives on Recent Causal and Statistical Inference Debates

## Abstract

I present an overview of two methods controversies that are central to analysis and inference: That surrounding causal modeling as reflected in the “causal inference” movement, and that surrounding null bias in statistical methods as applied to causal questions. Human factors have expanded what might otherwise have been narrow technical discussions into broad philosophical debates. There seem to be misconceptions about the requirements and capabilities of formal methods, especially in notions that certain assumptions or models (such as potential-outcome models) are necessary or sufficient for valid inference. I argue that, once these misconceptions are removed, most elements of the opposing views can be reconciled. The chief problem of causal inference then becomes one of how to teach sound use of formal methods (such as causal modeling, statistical inference, and sensitivity analysis), and how to apply them without generating the overconfidence and misinterpretations that have ruined so many statistical practices.

### Key Words

Bias Causal inference Causation Counterfactuals Potential outcomes Effect estimation Hypothesis testing Intervention analysis Modeling Significance testing Research synthesis Statistical inference## Introduction

*An Anarchic Preface*

The present paper is a commentary on recent controversies surrounding methodologies for causal and statistical inference. My title pays homage to *Against Method* (Feyerabend 1975), the magnum opus of the philosopher of science Paul Feyerabend (1924-1994). When I took his class in the early 1970s, the book was in manuscript form and he taught around its theme, which I understood to be that scientists should not allow their inquiries to be straightjacketed by whatever conceptualizations, theories, and methods were being enforced as “correct” by current authorities. This theme, or perhaps more the way he presented it, earned him the label of “anarchist” – one which he embraced, for he believed in theatrical attack on the rigidity of traditions and on claims that any aspect of methodology was *the* (singular) “scientific method” (Feyerabend 1995).

The message I took away was: Every methodology has its limits, so do not fall into the trap of believing that a given methodology will be necessary or appropriate for every application; conversely, do not reject out of hand any methodology because it is flawed or limited, for a methodology may perform adequately for some purposes despite its flaws. But such constrained anarchism (or liberal pluralism, if you prefer) raises the problem of how to choose from among our ever-expanding methodologic toolkit, how to synthesize the methods and viewpoints we do choose, and how to get beyond automated methods and authoritative judgments in our final synthesis. Possible solutions may be suggested by history and speculation about the psychosocial as well as logical factors that fuel resistance to and acceptance of theories and methodologies.

In keeping with that view, what follows are anecdotal musings on the essential tension between experience-based intuition on the one hand, and formal technologies on the other. Several citations provide further treatments of these topics in causal inference (e.g., Greenland 2012a; Porta et al. 2015; Morabia 2015; Porta and Bolúmar 2016). I focus on controversies about causal modeling and the ‘causal inference’ movement, and the problems of null bias and overconfidence that plague statistical methodology. I conclude that there are valid concerns about formal modeling and the artificially precise inferences it generates, especially when these formal statistical inferences are treated as sound judgments about causality. But these concerns do not detract from the value of causal modeling as an *aid* to inference, providing precise (but not absolute) perspectives on causal questions. Causal modeling is thus seen as a useful technology emerging from formal methodology, rather than a theory or philosophy of inference (which it is not). This perspectival view should redirect the focus of debates to the many problems that need to be addressed as new methodologies are integrated into instruction and practice.

*My view of methodology*

Similarly to other writers (e.g., Gigerenzer and Marewski 2015; Krieger and Davey Smith 2016; Vandenbroucke et al. 2016; VanderWeele 2016a, b), I hold that soft sciences (like epidemiology and sociology – sciences that cannot expect to discover numerically precise and general contextual laws analogous to those in physics) need to incorporate a broad range of methodologic perspectives and tools, especially when competing approaches provide checks on each other.^{1} Not just any methodology can be admitted to this toolkit, however: Each needs to provide its own explicit justification to aid valid application and criticism. Meeting this need to the satisfaction of scientists is what makes a method ‘scientific’.

Causal-inference methodology focuses how to explain and act on observed associations. Modern causal modeling (“causal inference”) theory offers many tools for avoiding logical fallacies in explanations; but some argue that it overemphasizes mathematics and deduction at the cost of the study description and contextual synthesis essential to reliable real-world inference (Krieger and Davey Smith 2016; Vandenbroucke et al. 2016; Broadbent et al. 2016; see also Freedman 1987). In a complementary fashion, traditional statistical modeling offers tools for estimating and summarizing the associations to be explained, and thus is integral to the causal-inference process; but arguably it too has overemphasized mathematical derivations at the expense of proper interpretation (Greenland et al. 2016). Compounding these problems is that traditional statistics fails to clearly separate the causal (or explanatory) and associational (or passively predictive) aspects of inference.

This melding of causal and associational inference is understandable: The theory of randomized designs led to the explosion of statistics in the mid-20^{th} century and still dominates statistical training. In this theory, random selection and allocation allow us to equate certain observable associations with targeted effects (Greenland 1990), so that causal inference can focus on purely statistical deductions such as “if the no-effect model is correct, we would rarely see an association as or more extreme than this”.

Causal modeling attempts to maintain this deductive focus within imperfect research by deriving models for observed associations from more elaborate causal (‘structural’) models with randomized inputs (the current “broken experiment” paradigm). But in the world of risk assessment – a world in which epidemiology is a major but not the only source of information – the causal-inference process cannot rely solely on deductions from models or other purely algorithmic approaches. Instead, when randomization is doubtful or simply false (as in typical applications), an honest analysis must consider sources of variation from uncontrolled causes with unknown, nonrandom interdependencies. Causal identification then requires nonstatistical information in addition to information encoded as data or their probability distributions (Robins and Wasserman 1999; Broadbent et al. 2016).

This need raises questions of to what extent can inference be codified or automated (which is to say, formalized) in ways that do more good than harm. In this setting, formal models – whether labeled “causal” or “statistical” – serve a crucial but limited role in providing hypothetical scenarios that establish what would be the case if the assumptions made were true and the input data were both trustworthy and the only data available. Those input assumptions include all the model features and prior distributions used in the scenario, and supposedly encode all information being used beyond the raw data file (including information about the embedding context as well as the study design and execution).

… our present failing is an over-confidence in modern technique and a lack of appreciation of the value of that wisdom which can only be obtained by personal observation and experience. I would urge the younger generation to think more and observe more for themselves, as their forefathers did, and not be so ready to bow the knee in a fanatical worship of so-called scientific methods of investigation and treatment.

## Resistance to formal methodology

Proposals to expand quantitative methods have long met with resistance. Nearly a century ago Major Greenwood (1924) felt the need to write an essay in Lancet entitled “Is the statistical method of any value in medical research?” This essay was written *before* the Fisher/Neyman-Pearson revolution that overthrew the Bayesian (“inverse probability”) significance testing used by Greenwood and his predecessors (Student 1908). Greenwood modestly sought to defend statistical reasoning based on tabulations of existing nonexperimental data, and closed with a prescient warning against inference dominated by statisticians. But Greenwood was already ancient history when I entered the field, and I was unprepared for the ferocity of opposition to formal methods outside the narrow scope of conventional statistics.

The same year, the eminent biostatistician Samuel Greenhouse (1918-2000) (1980, p. 269) wrotea continuing concern for methods, and especially the dissection of risk assessment, that would do credit to a Talmudic scholar and that threatens at times to bury all that is good and beautiful in epidemiology under an avalanche of mathematical trivia and neologisms.

It is easy to dismiss such complaints as the usual power defense of established authorities, which often boil down to “I’ve been successful without all these intricacies, therefore don’t you bother with them either”. But the above quotes are from accomplished sources, so we might seek other, more legitimate reasons for their negative reactions.A great deal of excellent statistical talent is being devoted to an inordinate concentration on procedures and concepts which are terribly overworked. How many more papers on matching, on covariates, on synergism and antagonism can we tolerate?

Methodologists (including myself) can at times exhibit poor judgment about which of their new discoveries, distinctions, and methods are of practical importance, and which are charitably described as ‘academic’ (such as 3^{rd}-digit errors in relative-risk estimates). Weighing the costs and benefits of proposed formalizations is crucial for allocating scarce resources for research and teaching, and can depend heavily on the application (Greenland 2012a). Field researchers prefer informal and thus more easily communicated contextually based descriptions, even if those are more prone to ambiguity and logical error than are formal deductions. A problem however is that contextual idiosyncrasies can lead to blindness about when those descriptions mislead.

An example from the time Stallones and Greenhouse wrote was the then-emergent division among risk and ‘relative risk’ measures into risks, rates, odds, and their ratios (with even finer distinctions possible, especially when one considered sampling designs). At the time, outcomes under most intense research (especially cancers) were uncommon enough over typical study periods so that the differences among the ratio measures were not impressive. But the resulting dismissal of the distinctions betrayed a narrow vision blind to topics in which a “rare-disease” assumption could become unsupportable, as in injury and clinical epidemiology – where on the order of half of a cohort might experience the study outcome.

The methodology movement went well beyond mere technicalities to seek general principles that could guide or replace the wooly expert judgment. The need was clear: While experts could supply strong opinions about the causal status of associations, their judgments often incorporated implicit, questionable assumptions and logical fallacies, and were especially prone to overconfidence and other cognitive biases (Lash 2007). By using concepts of causation and bias precise enough to reveal and address these problems, rigorous logic could become an integral part of causal inference, alongside (or for the more radical, in place of) informal treatments supplemented by traditional statistics.

Much benefit can accrue from thinking a problem through within these models, as long as the formal logic is recognized as an allegory for a largely unknown reality. A tragedy of statistical theory is that it pretends as if mathematical solutions are not only sufficient but “optimal” for dealing with analysis problems when the claimed optimality is itself deduced from dubious assumptions. Such pretensions have been taken to task by eminent statisticians like Tukey (1962) and Box (1990) in the context of engineering. Likewise, we should recognize that mathematical sophistication seems to imbue no special facility for causal inference in the soft sciences, as witnessed for example by Fisher’s attacks on the smoking-lung cancer link (Susser 1977; Stolley 1991; Vandenbroucke 2009).

It is perhaps ironic, then, that such attacks (however wrongheaded) motivated an entirely new set of inferential tools to address uncertainties beyond randomness (Porta et al. 2015; Morabia 2015), stimulating not only the Hill considerations (Hill 1965) but also early models for confounding (Cornfield et al. 1959; Bross 1967) and multifactorial causation (MacMahon and Pugh 1967; Rothman 1976). Thus one could say that attacks by theorists did a service in stimulating formal analyses of causation and bias, leading to the methodology under debate today – which has in turn generated criticism from those unenamored by formal methods.

## Resistance to potential-outcome models

Although causal models extend back nearly a century (Wright 1921; Neyman 1923; Welch 1937; Wilk 1955), extensions of potential-outcome (counterfactual) models to nonexperiments arose in parallel with modern epidemiology (Simon and Rescher 1966; Rubin 1974). These extensions soon led to analytic innovations such as propensity-score and longitudinal models for causal analysis (Rosenbaum and Rubin 1983; Robins 1987). But they also led to fierce opposition to the potential-outcome framework, e.g., see Newman and Browner (1988) vs. Greenland (1988); Dawid vs. his discussants (2000); and Shafer (2002) vs. Maldonado and Greenland (2002). This resistance came not only from some traditional epidemiologists, but also from those statisticians who viewed their ‘classical’ methods (derived from experimental statistics established only in the 1930s) as necessary and sufficient for statistical analyses of causation, even when applied in purely observational settings.

One unforgettable experience involved a 1992 paper by Robins and Greenland (1992) appearing in *Epidemiology*, which has been cited widely as describing conditions for unconfounded mediation analysis under potential-outcome models (e.g., see VanderWeele (2015) and its citations). This paper was initially submitted to the *International Journal of Epidemiology* (IJE) and rejected firmly by its statistician reviewer, who dismissed counterfactual models as “dangerous” and “misleading”. Ironically, we submitted to the IJE first because it had accepted with ease our earlier paper (Greenland and Robins 1986) laying out conditions for unconfounded estimation of basic causal effects under a potential-outcome model.^{2}

Since then, such models have come to dominate the literature on causal-inference methods. Their triumph in this regard has generated a new wave of concerns and criticisms (e.g., King and Zeng 2007; Glymour and Glymour 2014; Vandenbroucke et al. 2016; Krieger and Davey Smith 2016; Schwartz et al. 2016, 2017), a few of which I discuss below. Aggravating concerns is that the causal modeling movement adopted “causal inference” as its banner, betraying pretensions to comprehensiveness despite the fact that its methods are chiefly limited to analysis of single studies, rather than the evidence synthesis required for practical causal inference (Greenland 2012a). Nonetheless, it has been pointed out that some current criticisms of causal modeling reflect general problems in any rational account of causal inference, and that other criticisms are misunderstandings of the technical details and utility of the models (Daniel et al. 2016; Hernán 2016; Robins and Weissman 2016; VanderWeele 2016a, b; VanderWeele et al. 2016). I review a few these issues below.

*Objections based on the randomized-trial standard*

It has been said that causal models are dangerously narrow because they identify effects only under what may be viewed as experimental designs (albeit arbitrarily complex ones), raising fears of taking randomized trials as gold standards of evidence. Randomized trials are indeed a problematic gold standard in practice (Broadbent et al. 2016; Maldonado 2016; Mansournia et al. 2017): Not only do they typically suffer from internal problems and often severe imprecision, but also, due to eligibility restrictions, tend to have more severe generalizability (transportability) issues than do passive observational studies.

So-called “natural experiments” usually involve fewer eligibility restrictions at the expense of less reliably random exposure. But there is no sharp boundary between natural experiments and nonexperiments; a natural experiment is just an observational study in which no one disputes that a particular cause of exposure appears to be randomly distributed (Dunning 2008). This reality makes “natural experiment” a group judgment based on shared assumptions, where the latter are summarized in a causal model in which the effect is identified.

Then again, there is no sharp boundary between natural and artificial experiments: The label “randomized experiment” is also a group judgment based on a shared causal model – a model which assumes (among other things) that the trial followed all protocols necessary to preserve randomization, and was analyzed and reported completely and accurately. As with all study reports, there is also a presumption of innocence to the effect that errors will be at least limited somewhat by honest intent of the researchers; but, based on infamous examples of fraud and denial (Anonymous 2009; Baggerly and Gunsalus 2015; George and Buyse 2015), some may be unwilling to automatically assume such integrity.

Thus, real randomized trials are doubtful if not dangerous gold standards: As with observational studies and contrary to common mythology, a trial does not identify the target effect without rather strong assumptions beyond randomization of exposure. The same assumptions also underlie the application of traditional statistical methods to causal inference, and are controversial whether or not we adopt formal causal models for analysis.

*Objections based on narrowness of scope*

^{3}That view was criticized by Glymour (1986) and Glymour and Glymour (2014) on grounds similar to those given by recent epidemiologic critics, i.e., elements of causality not captured by interventionist accounts reflect a deficiency of that account, rather than a deficiency of common usage. To quote Clark Glymour (1986, p. 966):

Thus, while it is fair to warn against treating all causes as if they were interventions, it is a mistake to extend this caution into a claim that only interventions or treatments should be considered as causes, or that potential outcomes are logically limited to interventions.People talk as they will, and if they talk in a way that does not fit some piece of philosophical analysis and seem to understand one another well enough when they do, then there is something going on that the analysis has not caught. That is not a failing of the speakers. It is, if anything, a failing of we who philosophize,

even if we philosophize with statistics. [emphasis added]

Holland (1986) further asserted that we are especially limited to studying effects of causes (“forward causal inference”) rather than causes of effects (“reverse causal inference”). Unfortunately, this assertion has been repeated uncritically and turned into a criticism of potential-outcome models (Schooling et al. 2016), although to do so is incorrect (Pearl 2015). To see why, note first that in common usage, there is no difference between saying “A caused B” and “B was an effect of A”; paralleling that usage, in potential-outcome theory both phrases mean that there is an analysis unit whose value of B varies with the value of A, thus requiring a change in the potential B outcome as one moves across levels of A (and an arrow from A to B in a causal diagram).

The notion that causes of effects cannot be aided by potential-outcome models may have stemmed from the incorrect identification of the models with intervention analysis. To be sure, there is a distinct logical asymmetry between searching for effects of interventions and for effective interventions on an outcome. Effects can be found by randomization of the intervention; in theory, one trial could satisfy this search. In contrast, there is no limit on the number of trials we might need to find an effective intervention for a targeted outcome.

Going beyond interventions, a more general asymmetry arises in structural-equation models that compute effects from causes: In those models, considerable information about the causal inputs is lost when computing the output effects^{4}, reflecting again the intrinsically greater difficulty of identifying causes from effects (Dawid 2000; Gelman 2011). Nonetheless, these difficulties are no basis for the claim that potential-outcome models cannot aid reverse causal inference, as when they are used to model effects and biases in case-control analyses of multiple exposures.

*Narrow identification conditions are not intrinsic to potential outcomes*

*not*necessary conditions for causal inference (Broadbent et al. 2016). Specifically,

- 1.
Exchangeability is not necessary: If we observe equal death rates among the exposed and nonexposed, and know that the nonexposed smoke more than the exposed, then (barring qualitative interactions) this nonexchangeability will only strengthen any inference that exposure causes death. In terms of potential outcomes, knowing the nonexposed smoke more leads us to infer a negative association between exposure and the

*potential*death indicator under nonsmoking; this is a direct violation of exchangeability. Combining this inference with no association between exposure and the observed outcomes leads us to infer there is an exposure effect.^{5} - 2.
Positivity is not necessary: Suppose we observe that those given 10 and 20mg of a drug suffer a side effect at 5 and 25 times the rate of those given a placebo and infer that this is a drug effect, but 40mg was never administered. Then upon assuming monotonicity of the effect we would not only infer that 10 and 20 mg can cause the side effect but also that 40 mg can cause the side effect, even though positivity fails.

^{6} - 3.
Consistency is not necessary: Suppose we observe that a continuous outcome is associated with a drug and infer that this is a drug effect, but is mismeasured classically relative to the potential outcome under the actual treatment received. Then consistency will be violated (because the observed outcome is not always equal to the potential outcome under the treatment received) but we would nonetheless infer that the drug affects the outcome.

^{7}This point might be evaded by declaring the measured outcome to be a variable distinct from the realized potential outcome, but nonetheless shows how defining consistency as an assumption in causal modeling invites confusion of important measurement-error and variable-definition considerations by compressing them into a narrow potential-outcomes framework, at the cost of logical clarity and simplicity (Pearl 2010).^{8}

^{9}

More generally, statistical analyses do not supply necessary or sufficient conditions for inference outside of their adopted formal frameworks; they only illustrate what *could* be inferred from the assumed data and model along with further assumptions (as seen in inferences about effects derived from data, a marginal-structural model, and the 3 identification conditions listed above). Thus causal modeling, like other statistical approaches, is in the end an exercise in hypothetical reasoning to aid our actual inference process, and should not be identified with the process itself (except perhaps in narrow artificial-intelligence applications). Instead, any inferential statistic (such as an effect estimate derived from a causal model) should be viewed as the result of only one scenario in a potential sensitivity analysis, and thus can be (and usually is) misleading without reference to results derived under other scenarios.

That said, potential-outcome models can express most of the causal concepts and bias concerns I have encountered in practice. Whether it is always helpful to express problems in such formal terms involves considering the trade-off between the precision benefit and the attention to detail bias modeling requires, along with the risk of overconfidence that the mathematics may generate from an overly narrow bias model (Poole and Greenland 1997).

*Other objections to criticisms*

Defenses of causal models have emphasized the improved relevance and precision that counterfactuals force into thinking about causes and effects – enough precision to deduce experimental tests of the resulting precise hypotheses (Daniel et al. 2016; Hernán 2016; Kaufman 2016; Robins and Weissman 2016; VanderWeele 2016a; VanderWeele et al. 2016). Formalization is further justified as providing an entry point for deductive reasoning and the mathematical power it brings, whose justification does not require a claim to encompass all types and needs of inference. From this view, to criticize detailed mathematical treatments (such as VanderWeele’s (2015) remarkable book) for not covering every topic needed for actual causal inferences seems like criticizing an aerodynamics textbook because it does not cover all the tasks needed to manufacture airplanes.

These objections to RPOA leave aside the complaint that “intervention” seems itself ambiguously defined (Vandenbroucke et al. 2016; Pearce and Vandenbroucke 2017; Broadbent et al. 2016). Responses emphasize that vagueness afflicts all practical uses of language and logic, including usage of “cause” and “effect,” and that potential-outcomes modeling helps pinpoint and reduce vagueness in studies of causation (Robins and Greenland 2000; Robins and Weissman 2016; Hernán 2016; VanderWeele 2016a; VanderWeele et al. 2016). These replies parallel Glymour’s (1986) message: In practical terms we operate well enough with conceptual ambiguity, allowing that a cause might be termed an intervention by some but not all discussants.

In agreement, some criticisms of causal modeling indicate that the vagueness of natural language in informal discussions is no reason to discard that discussion in favor of purely formal, mathematical analyses (Krieger and Davey Smith 2016; Vandenbroucke et al. 2016). More strongly put: Causal modeling cannot take over all the synthetic (‘inductive’) judgments required for actual causal inferences (as opposed to outputs of algorithms) (Greenland 2012a; VanderWeele 2016a; Broadbent et al. 2016) and may not always provide helpful insights (VanderWeele 2016c), but nonetheless modeling exercises can help sharpen informal reasoning.

In the latter role, the interpretations of terms in the model do not have to be settled for deductive aspects of modeling to proceed. Logic and mathematics operate with undefined (‘primitive’) concepts and abstract objects defined only from those concepts, which together form the logical structure (syntax) of the model.^{10} The interpretation (meaning or semantics) of the model and the deductions from it are left to the user. There are always a huge number of interpretations that would follow the logical structure; more precise definitions narrow down the vast space of logically acceptable interpretations to a far more practical range.

For example, probability theory is *the* accepted formal model for the concepts of association and independence. Nonetheless, the causal-modeling movement has shown persuasively that more mathematical structure is needed for causal inference (e.g., causal models capture the key distinction between noncollapsibility and confounding, concepts which are traditionally confused in statistics under the heading of “Simpson’s paradox” (Greenland et al. 1999; Pearl 2009; Hernán et al. 2011)). But as valuable as causal modeling is, it remains insufficient for valid scientific inference because *no* formal method is sufficient (Broadbent et al. 2016).

*Feasibility and precision: Not necessary, but desirable*

Counterfactual reasoning (whether or not in the form of potential-outcome modeling) can be applied to any putative causal variable one desires, leaving to the reader to judge whether the resulting effect estimate is reliable for guiding policy. Nonetheless, aiming our causal analyses toward feasible interventions aligns them with public-health goals and consequentialism (Greenland 2005; Hernán 2005; Galea 2013; Glass et al. 2013; Kaufman 2016; Keyes and Galea 2015; Chiolero 2016). Yes, noninterventions like earthquakes are clearly causes of adverse outcomes; furthermore, we can reason quite usefully about counterfactuals such as “had the earthquake not occurred the house would not have collapsed,” thus demonstrating that limiting counterfactual causal reasoning to feasible interventions is counterproductive. But current policy actions based on that causal knowledge will involve alteration of building codes and their enforcement, not earthquake manipulation. Thus what at first might seem an overly narrow approach to causality serves public health well.

Counterfactual definitions arise in many philosophical accounts of causality (Hume 1748; Mill 1843; MacMahon and Pugh 1967; Stalnaker 1968; Lewis 1973). We can talk informally of counterfactuals for factors that lack universally accepted precise definitions, such as low SES, nonwhite race, or war (e.g., “had the second world war been averted, the population of Russia would have been much higher today”), but the vagueness of their counterfactual conditions make it difficult to see how they can be meaningfully treated as causes in potential-outcome models unless redefined with a precision absent from ordinary usage (Greenland 2005; Hernán 2005; Hernán and Taubman 2008; VanderWeele and Robinson 2014; Kaufman 2014; VanderWeele 2016a). In contrast, for tasks such as choosing among precise treatment regimes, potential outcomes appear to be natural if not indispensable (Hernán and Robins 2017).

In sum, like others (e.g., Vanderweele and Hernán 2012; Vanderweele et al. 2016; VanderWeele 2016a; Naimi 2016) I have no problem with common usage in which SES, race, and the like are labeled “causes”, even though one may question the practical value of treating such vague, complex entities as if they index potential outcomes in a causal model. Instead, practical concerns should prod us to extend our analysis to potentially modifiable components of the initial study factors, for which potential outcomes can be made precise.

*Realism and precision*

Interpretations that treat potential-outcome models as representations of reality (as opposed to mere aids for reasoning) invoke a *definiteness* assumption in which the outcomes corresponding to counterfactual treatments exist in some objective sense. Precise exposure definitions not only align causal modeling more closely with consequentialism; they also make definiteness more plausible. Definiteness is automatically satisfied by mechanistic models even if their output quantities are merely distributions (VanderWeele and Robins 2012),^{11} for to have a mechanism is to have a formula that provides what the outcome would be upon entering input causes into the formula. Nonetheless, definiteness is difficult to conceive let alone evaluate when no mechanism or intervention is offered for producing counterfactual exposure states, or when those states are ambiguous.

Consider classical deterministic physics: We can view the famed Newtonian law F = ma as providing a structural equation a_{F} = F/m for the potential accelerations a_{F} that would be seen for an unrestrained body of mass m subjected to a possibly counterfactual force F that causes acceleration. Such examples show how the engineering power of physical sciences derives from translating descriptive (associational) laws into causal laws that provide accurate potential outcomes in precisely delineated settings (Pearl 2009). Statistical versions begin by allowing random disturbances in the equation, for example defining stochastic potential outcomes for acceleration via ln(a_{F}) = ln(F) − ln(m) + ε (which can be seen as a regression equation with known coefficients 1 and −1 and a random intercept ε representing unmeasured forces).

Causal interpretations of physical laws complete the physics-engineering paradigm, which has inspired aggressive mathematical modeling in statistics and causality theory. But there is no such transparent physical basis for structural equations and potential outcomes in soft sciences. That lack of basis, coupled with ambiguity of factors and the resulting indefiniteness, leave structural equations and potential outcomes without precise meaning or clear utility (King and Zeng 2007).

Another objection to extension of potential-outcome models beyond manipulable variables is that the label ‘causal model’ might itself warp judgments toward inferring causality, even when causal modeling adds no reliable information beyond regression analyses (King and Zeng 2007). Relatedly, there has been considerable debate about the practical value of focusing mediation analysis on ‘natural’ (pure) direct and indirect effects as opposed to controlled effects (Naimi et al. 2014; VanderWeele 2015). Because natural effects are typically not identified by feasible randomized experiments, the debate might be seen as a variation on the above controversies, but it also raises deeper issues of model realism and the distraction that neat mathematical results can create from practical pursuits (Naimi et al. 2014).^{12}

## Resistance to causal diagrams

Even in their weakest form, diagrams can be used to help specify and evaluate more traditional statistical models and methods. Given their compatibility with traditional methods, along with their generality and relative transparency, one might have expected causal diagrams to be welcomed by the research community. Alas, this has not always been the case, as witnessed by critiques combining vague discouragements with basic misunderstandings of graphical models.

In particular, not all criticisms of potential outcomes or counterfactuals carry over to graphical models. Contrary to some critics, causal diagrams do *not* require counterfactuals or any particular form or interpretation for the included variables (Spirtes et al. 2001; Greenland and Brumback 2002; Robins and Richardson 2011). Impressions otherwise might have arisen from treatments in which causal diagrams are directed graphical displays of nonparametric structural-equation models (Pearl 1995, 2009). In these models, every variable in the diagram is a deterministic function of its direct causes (graphical parents) and an independent random effect. This is indeed a counterfactual formalization of causal diagrams, although not the only one (Robins and Richardson 2011; Richardson and Robins 2013).

- 1.
Causal completeness: effects between the graphed variables correspond to directed paths from causes to effects, and

- 2.
Observational compatibility: all associations among the graphed variables are transmitted only by open paths inside the diagram (i.e., there is no association between variables that have no open path between them).

Analogous assumptions can be found in other formal causal-inference methodologies (usually as an assumption of no uncontrolled bias or the like), and thus can be viewed as encoding a minimal set of assumptions for qualitative causal inference. The assumptions add enough structure beyond the DAG to capture the difference between bias from failure to condition on shared causes (“classical confounding” or directly causal confounding) and bias from conditioning on shared effects (collider bias) – a distinction which seems fiendishly difficult to convey using only probability and potential outcomes.^{13}

The causal model implied by the above minimal assumption pair involves as many strict assumptions as there are absent arrows. The strength of the two assumptions increases in proportion to the number of arrows and variables omitted from the diagram, with every omission an assumption of one or more causal null hypotheses (Greenland 2010a). Thus the more sparse and simple the diagram, the stronger the model it represents. There are also assumptions and phenomena that current causal diagrams leave implicit or do not capture (Dawid 2008; Flanders et al. 2011; Maldonado 2013; Greenland and Mansournia 2015a; Aalen et al. 2016), such as random confounding, or capture only clumsily, like interactions.

As with other causal models, inferences from causal diagrams come with validity guarantees only when the diagram represents an experiment (perhaps natural) in which the input variables (root nodes) were independently randomized, *and* subsequent selection into the analyzed data set (not just the study) was also independently randomized – as in a perfect if complex trial with perfect subsampling. Provided a selection variable S appears as a conditioned node in the diagram, this feature makes a diagram excellent for illustrating what may go wrong with various study designs and adjustment procedures. But, as with other devices, any doubts about the diagram’s many null assumptions make it much less reliable for telling us what we can safely infer about the targeted reality. Especially, without actual randomization of a variable, the complete absence of an effect on it from a potential direct cause is usually unsupportable with data, and rarely credible for complex possible direct causes with many avenues for effects (e.g., age, race, and sex).

## Resistance to basic statistical reform

No one should doubt that critical scrutiny of established as well as new methodology is warranted, for there are dramatic examples in which standard statistical technologies caused widespread damage to science. In particular, the quantitative methodology movement saw how much of research statistics was mired in faulty traditions derived from misreadings of innovators who could not foresee the harms their innovations would produce.

The prime example is of course null-hypothesis significance-testing, an astonishing ‘worst of both’ hybridization of Fisherian significance testing with Neyman-Pearson alpha-level hypothesis testing (Goodman 1993) that has dominated statistics yet addresses only random errors – which are often of least concern in nonexperimental research. Null testing serves as the poster child for the disasters that can result from casual interpretations, as in the dumbed down – and often thoroughly wrong – definitions of P-values that pervade elementary guides and textbooks. For example, saying the P-value measures “the probability that the difference is due to sampling error” or “chance” is logically no different than claiming the P-value is the probability of the null hypothesis, which it is emphatically not (Wasserstein and Lazar 2016; Greenland et al. 2016). Among other examples are the misuses of correlation coefficients as effect measures, and of standard deviations as “standard” units for covariate measurements – misuses which are still the norm despite over 60 years of criticism (Tukey 1954; Greenland et al. 1991).

Early on, calls for shifts away from null testing in medicine and public health (Rothman 1978, 1986; Walker 1986a, b) elicited defensive responses from some biostatisticians (Fleiss 1986a, b; Rouen 1987; Lachenbruch et al. 1987). Rebuttals to those defenses called for methods that analyzed multiple values for effect sizes, not just the null, such as P-value functions and likelihood graphs, which had long existed in the statistics literature but had been ignored in practice (Poole 1987; Goodman and Royall 1988). Although these calls were also largely ignored, subsequent decades saw increasing rejection (pun intended) of null testing, usually in favor of confidence intervals – as long recommended by more perceptive statisticians (Yates 1951; Cox 1958; Altman et al. 2000). This shift away from obsessions with null testing paralleled movements in other fields, especially psychology and social sciences (e.g., Gigerenzer 1998, 2004).

*Biased statistical interpretation remains pervasive*

Unfortunately, interval estimation did not stop core misinterpretations of statistics. It is thus fair to ask how much this failure was due to teaching deficiencies and how much was due to deeper cognitive problems embedded in established doctrines. The problem most apparent is a general bias toward null conclusions fed by beliefs that this bias, rather than impartiality, is a hallmark of “the scientific method” (Greenland 2011, 2012c). As an example (sadly not atypical), Seliger et al. (2016) reported a case-control study of statins and risk of glioma with conclusions such as “use of statins was not associated with risk of glioma”, “this matched case-control study revealed a null association between statin use and risk of glioma”, and “our findings do not support previous sparse evidence of a possible inverse association between statin use and glioma risk”. These statements cite an estimated odds ratio of 0.75 with 95% CI of 0.48-1.17 given in their Abstract, which in their Discussion is compared to the 0.72 (0.52-1.00) and 0.76 (0.59-0.98) reported in the previous studies. Thus remarkably consistent findings of inverse association were reported as conflicting, all because only the widest of the 3 intervals – the authors’ – included the null. Anyone who fails to see how the quoted statements are completely wrong should immediately and carefully read Greenland et al. (2016) which appeared a few issues previously in this, the same journal, and the ensuing letter regarding Seliger (Greenland 2017a).

The medical literature abounds in parallel absurdities, such as writing as if there is a meaningful difference between P=0.04 and P=0.06 when there is not (Tukey 1962, p. 62; Gelman and Stern 2006). And adherence to a special status for null hypotheses remains dominant in medical sciences, despite it reflecting a generally false assumption that, for all stakeholders, incorrect null rejections (false positives) are more costly than incorrect acceptances (false negatives) (Neyman 1977). This value-laden assumption often reflects the innate conflicts of interest in postmarketing medical research, where nearly all physician’s practices are involved as database contributors, and where denial of side effects is incentivized by the potential economic loss and liability that those effects generate (Greenland 2016). Such conflicts make it unsurprising that fallacious, null-biased interpretations of P-values and confidence intervals remain prevalent.

As has been widely discussed under the heading of the “replication crisis”, one explanation for null bias is fear of the widespread bias *away* from the null produced by searching for “significant” results to report (fishing, significance questing, P-hacking) or by data-driven model selection (Leamer 1978; Ioannidis 2008; Gelman and Loken 2014a). The incentives for “significance questing” have been discussed extensively, including the need to find results ostensibly worthy of publication (Rhodes 2016). But almost no attention is given to the mirror problem in which analysts search for or engineer *non*significance, as when they would prefer to infer safety from studies of adverse effects and so employ a power-destroying Bonferroni correction to adjust away any troublesomely small P-values. Thus there is null bias even in the literature on problems created by null testing!

In light of the limited reforms after decades of concerted complaints about statistical bias and misinterpretation (both toward and away from the null), we must consider psychosocial obstacles to change. For fields like health and social sciences, career and social pressures for unambiguous conclusions comes up against the painfully slow rate of evidence accumulation (Belluz et al. 2016; Farsides and Sparks 2016). Those pressures, along with the need to make one’s study appear as important as possible, fuel a preference for overconfident misinterpretations and illusory decisiveness (Gelman and Loken 2014b). Such perverse incentives have typically been countered by bias (shrinkage) toward the null, as crudely implemented by null testing, and more defensibly as null penalization (Greenland and Mansournia 2015b). This bias has evolved into endemic misinterpretations that carry the authority of textbooks and tradition – as attested by the way fallacious inferences continue to flow from otherwise reasonable researchers who treat their interval estimates as tests of the null, and thus fail to understand their own confidence intervals as expressions of uncertainty (Poole 2001).

*Different does not mean better: the case of Bayesian testing*

One attempt to address the deficiencies of conventional statistics can be found in the resurrection of Bayesian methodology the late 20th century, which encountered considerable resistance of its own. Unfortunately, its testing methodology absorbed the same nullistic fallacies of significance testing into superficially more sophisticated (and thus far harder to correct) methods. Those methods proceed by placing spikes of prior probability at the null, leading to claims that P-values “exaggerate evidence” against the null hypothesis (e.g., Sellke et al. 2001; Goodman 2005; Wagenmakers 2007). A catch is that, *by definition*, in soft sciences there is no accepted law of nature from which we can deduce that some causal parameter is null; there is thus no basis for a null spike.

By loading the null with a quantity representing nonexistent absolute evidence for a fictional law, one is creating inferential bias toward the null (Greenland 2009a, 2011, 2016). Thus we should not be surprised if, according to these methods, *P*-values spuriously appear to be evidentially biased away from the null. Fortunately, not all those who appreciate both frequentist and Bayesian methods have fallen into this trap, recognizing that a *P*-value does supply information (limited though that is) about the compatibility between the model (set of assumptions) and the data used to compute it^{14} (Senn 2001, 2002; Greenland 2009a; Gelman 2013; Greenland and Poole 2013; Greenland et al. 2016).

## The problem of unjustified assumptions: sensitivity and bias analysis

An ongoing concern is that excessive focus on formal modeling and statistics can lead to neglect of practical issues and to overconfidence in formal results (Box 1990; Greenland 2012a; Gelman and Loken 2014b). Analysis interpretation depends on contextual judgments about how reality is to be mapped onto the model, and how the formal analysis results are to be mapped back into reality (Tukey 1962; Box 1980). But overconfidence in formal outputs is only to be expected when much labor has gone into deductive reasoning. First, there is a need to feel the labor was justified, and one way to do so is to believe the formal deduction produced important conclusions. Second, there seems to be a pervasive human aversion to uncertainty, and one way to reduce feelings of uncertainty is to invest faith in deduction as a sufficient guide to truth. Unfortunately, such faith is as logically unjustified as any religious creed, since a deduction produces certainty about the real world only when its assumptions about the real world are certain (which is to say, accepted as laws of nature – something I hope none of us would say of any of our analysis models).

Unfortunately, assumption uncertainty reduces the status of deductions and statistical computations to exercises in hypothetical reasoning – they provide best-case scenarios of what we could infer from specific data (which are assumed to have only specific, known problems). Even more unfortunate, however, is that this exercise is deceptive to the extent it ignores or misrepresents available information, and makes hidden assumptions that are unsupported by data. An example of the latter is the assumption of reporting honesty. As cited earlier, violations have been well documented, and claims that such events must be exceptionally rare do not seem to be based on factual investigations of discovery rates. Thus a major hidden assumption is called into question, and weakening it (say by allowing a bias parameter for misreporting) could weaken inferences to an unknown and possibly large degree, depending on the context.

Despite assumption uncertainties, modelers often express only the uncertainties derived *within* their modeling assumptions, sometimes to disastrous consequences. Econometrics supplies dramatic cautionary examples in which complex modeling has failed miserably in important applications (Taleb 2010; Romer 2016). A case can be made that epidemiologic statistics has suffered similarly in the repeated failure of randomized trials to bear out apparent protective effects of micronutrients such as beta-carotene. These problems are not seriously addressed by using so-called “robust” or “semi-parametric” methods, which deal only with violations of explicit parametric assumptions rather than the validity assumptions of greatest concern to researchers.

Unless enforced by study design and execution, statistical assumptions usually have no external justification; they may even be quite implausible. As result, one often sees attempts to justify specific assumptions with statistical tests, in the belief that a high *P*-value or “nonsignificance” licenses the assumption and a low *P*-value refutes it. Such focused testing is based on a meta-assumption that every other assumption used to derive the *P*-value is correct, which is a poor judgment when some of those other assumptions are uncertain. In that case (and in general) one should recognize that the *P*-value is simultaneously testing all the assumptions used to compute it – in particular, a null *P*-value actually tests the entire model, not just the stated hypothesis or assumption it is presumed to test (Greenland et al. 2016).

Another problem with assumption testing, known for generations but widely ignored, is that pretesting – model or variable selection based on significance testing (as in stepwise regression) – can produce enormous bias and distortions in statistics based on the final model (Bancroft and Han 1977; Leamer 1978). These distortions include unnecessary confounding and spuriously small P-values and standard errors (Greenland and Neutra 1980; Freedman et al. 1988), yet these problems are rarely mentioned in study reports that employ pretesting. Worse, pretesting has been defended by statisticians on the grounds of being reproducible (e.g., Fleiss 1986b) – never mind that it is reproducing distortions like deceptively narrow confidence intervals.

A more sensible approach to the uncertain-assumption problem is to expand modeling in the form of sensitivity and bias analysis (Leamer 1985; Phillips 2003; Greenland and Lash 2008; Lash et al. 2009). These analyses do indeed allow us to weaken our assumptions, and thus can provide more broadly valid inferences. Unfortunately, sensitivity analyses are themselves sensitive to the very strong assumptions that the models and parameter ranges used are sufficient to capture all plausible possibilities and important uncertainties, yet do not overemphasize dubious possibilities or include absurd scenarios (Greenland 1998). Such a meta-assumption is never assured in practice, and failure to recognize this fact can lead to inferences whose overconfidence is only boosted by the sensitivity analysis (Poole and Greenland 1997). Thus, while sensitivity and bias analyses are a step forward from traditional modeling, like all methodologies they are no panacea and should be approached with many cautions.

Some limitations of sensitivity and bias analysis can be addressed by replacing point-by-point parameter variations with random coefficients, penalty functions, or prior distributions based on genuine contextual information (Gustafson and Greenland 2006; Greenland and Lash 2008; Greenland 2009b; Lash et al. 2009; Gustafson 2010; Gustafson and McCandless 2014). Unfortunately, the sophistication demanded for proper use and review of these methods may not commend them to everyday practice or basic instruction: The many unfamiliar choices they require make them highly vulnerable to abuse and misunderstanding, and thus a potentially disastrous requirement for individual study reports (Greenland 2017b). Nonetheless, the potential for greater realism in these approaches suggests they should be part of the skill set for everyone who will direct or carry out statistical analyses of nonexperimental data, to help them gauge the distance between the model they are using and more realistic models, and to help them critically evaluate the sensitivity and bias analyses they encounter.

## Implications for training and practice

Whether informal guidelines or formal modeling technologies, all inferential methods are practical aids with strengths and limitations, not oracles of truth. For simple exposures such as one-time treatments, explicit causal models are not essential for valid reasoning about causation, and when the causal variables are vague one can deploy informal analyses to good effect – provided one does so with sound judgment, as developed and elaborated by our illustrious predecessors (e.g., Hill 1965; Susser 1977, 1991). Why then emphasize causal models?

As has been often argued, causal models and diagrams are incredibly useful for avoiding bad judgments, causal fallacies, harmful adjustments, and for identifying sources of bias and valid adjustments (Pearl 1995, 2009; Greenland et al. 1999; Höfler 2005; Phillips and Goodman 2004, 2006; Glymour and Greenland 2008). They are also indispensable for creating and understanding valid adjustments, especially when studying longitudinal treatment regimes (Hernán and Robins 2017). Thus, causal models are at least as important aids to inference as the *P*-values and confidence intervals that pervade our research, and I would include them in the basic education of all epidemiologists (as I did during my tenure at UCLA), as well as that of statisticians training to work in soft sciences.

Such technical requirements need to be thought out with unusual care, taking lessons from the horrific history of teaching and use of basic statistics (as attested by website and *textbook* misdefinitions of P-values). In particular, nullistic fallacies should be ferreted out and replaced by balanced discussion of competing hypotheses. Much time should be spent explaining the full details of what statistical models and algorithms actually assume, emphasizing the extremely hypothetical nature of their outputs relative to a complete (and thus nonidentified) causal model for the data-generating mechanisms. Teaching should especially emphasize how formal “causal inferences” are being driven by the assumptions of randomized (“ignorable”) system inputs and random observational selection that justify the “causal” label.

This focus on what are usually implicit assumptions would be a sharp departure from classical statistics training. In that tradition, inferences are supposedly driven by observed event frequencies (the data) – including purely hypothetical frequencies that are created by purely hypothetical randomization mechanisms. Being purely hypothetical, the frequencies against which models are calibrated have little genuine accuracy for discriminating among a broad range of causal models. This lack of relevant accuracy is reflected by the fact that the size of a P-value (whether large or small) may reflect some failure of the data model to remove a bias more than it reflects the presence or absence of the targeted effect.

A key point often mentioned by wise authorities (e.g., Tukey 1962; Susser 1977; Box 1990) – and yet often absent from statistics textbooks – is that formal inference technologies must leave out many important contextual details, and thus cannot replace sound judgment in application. The very informality of such judgment makes case study essential to methodologic education; overemphasis on formal models and algorithms at the expense of genuine contextual narratives may lead to disasters of mechanized inference akin to significance testing. Conversely however, such narratives tend to overlook or misunderstand counterintuitive phenomena, and may be too easily swayed by personal biases or generate overconfidence thanks to their mechanistic elements. Thus, formal and informal reasoning serve as checks on each other, so that resistance to and promotion of formal innovations need to be moderated away from any sense of exclusivity.

It will be especially essential to recognize when causal models have no basis beyond passively observed associations and thus encode no genuine information beyond their associational implications (something that modelers may easily overlook (Freedman 1987; King and Zeng 2007)). Passively observed frequencies supply only associational information, and (again) are compatible with many causal models. Thus, without external information to aid discrimination among causal models (such as internal randomization or external considerations (Hill 1965; Susser 1991)), we still need to deploy associational models for data reduction. In other words, mining and encoding of data information will continue to require acausal statistical modeling.

The latter need is met by current statistical training, however, which remains focused on associational inference and purely predictive modeling (such as ordinary regression). Pearl (2009) has challenged that focus, arguing that causation, not association, is apprehended most easily by our intuitions, and thus should be taught before statistical modeling. To do so, one first shows how associations are determined by causal pathways – for both the study effect, and for causation of bias via confounding, selection, or mismeasurement effects (Maclure and Schneeweiss 2001). Only then are statistical methods introduced to assess the associations via adjustments allowing for possible random (acausal) errors, or ‘noise’. This teaching order may sound radical in a world in which causal models are rarely taught in basic statistics, but better reflects the facts that causal inference is the ultimate goal of most soft-science research, and that noise reduction is a much more nonintuitive task than is basic causal inference from ideal noise-free experiments.

*Statistical training remains deficient in concept and scope*

I believe other educational omissions besides causal models have been major contributors to the currently lamented research “crises”. Two topics in dire need of early and continuing education are basic logic and cognitive psychology (Gilovich et al. 2002; Lash 2007). Especially important are the logical and statistical fallacies manifested in routine misinterpretations of basic statistics, and the biases built into current teaching and practice that encourage these fallacies (Box 1990; Greenland 2011, 2012b, c, 2016; Greenland et al. 2016). Their persistence attests to the fact that degrees in statistics and medicine do not require substantial training in or understanding of scientific research or reasoning, but nonetheless serve as credentials licensing expressions of scientific certainty (Greenland 2011, 2012c).

Despite laments about insufficient recognition (Breslow 2003), I think few doubt that statistics has made enormous contributions to the health sciences and is essential for sound research. While I am unaware of reliable data showing which educational approaches lead to good practice among researchers, decades of observational evidence has shown rather clearly that massive gaps and errors in statistical education and practice have damaged the research literature (Poole 2001; Ioannidis 2008). Worse, there are perverse incentives to defend past lapses and maintain the status quo (Belluz et al. 2016); indeed, the glacial pace of statistics reform should serve as a warning against excessive conservatism in teaching as well as practice.

These problems raise the question of whether basic statistics training should include approaches beside the currently dominant frequentist methodologies (significance tests, P-values, and confidence intervals). In the pluralistic spirit outlined for causal inference, the answer is yes: several statistical methodologies can be employed in the same problem to check on each other and to build analyses that have sound interpretations from multiple perspectives. In particular, compatible frequentist and Bayesian ideas can be integrated into a dynamic inferential process in which prior distributions or penalty functions are used to encode imprecise contextual information into fitted models, frequentist methods are used to test (calibrate) those models against data, and Bayesian methods are used to generate model-conditional inferences (e.g., see Box 1980; Dawid 1982; Rubin 1991; Little 2006; Gustafson and Greenland 2009; Greenland 2010b; Gelman and Shalizi 2013).

Nonetheless, I depart from the mainstream Bayesian revival in regarding simulation methods as insufficient for Bayesian education and sensible application (Greenland 2008; Sullivan and Greenland 2013, 2014; Discacciati et al. 2015), and in fearing that confirmationist (“objective”) Bayes testing methodology (Sellke et al. 2001; Goodman 2005; Wagenmakers 2007) is potentially even more biased for practice than frequentist significance testing (Greenland 2009a, 2011, 2016).

## Conclusions and Summary

Controversies over the role of formal, algorithmic methods are nothing new, surprising, or unhealthy. They seem to be aggravated however by treating methodologies as overarching philosophies rather than as technologies or tools. Debates have been muddied by the lack of clear boundaries separating uncritical promotion from sound justification, or obstructionism from cogent criticism. Promoters sometimes fail to recognize or adequately emphasize the limitations of their proposals, while critics can be equally fallible in their misunderstandings and misrepresentations of proposals.

It is in everyone’s interest to delineate points of genuine agreement and disagreement, with an eye to progress beyond current methodology. Error recognition can especially speed progress if corrective action is not blocked. In this regard, lingering controversies about basic statistics hold important lessons for the incorporation of formal causal-inference methods into general practice. One of them is that, because the complexity of actual context prohibits anything approaching complete modeling, the models actually used are never entirely coherent with that context, and formal analyses can only serve as thought experiments within informal guidelines. In effect, then, neither frequentist nor Bayesian methods alone (at least as presented to date) can suffice for reasonable interpretations of outputs of formal causal modeling. A cautious scientist will thus reserve judgment and treat no methodology as correct or absolute, but will instead examine data from multiple perspectives, taking statistical methods for what they are: semi-automated algorithms which are highly error-prone without extremely sophisticated human input from both methodologists and content-experts.

The practical limits of formal models become especially apparent when attempting to integrate diverse information sources. Neither statistics nor medical science begins to capture the uncertainty attendant in this process, and in fact both encourage pernicious overconfidence by failing to make adequate allowance for unmodeled uncertainty sources. Instead of emphasizing the uncertainties attending field research, statistics and other quantitative methodologies tend to focus on mathematics and often fall prey to the satisfying – and false – sense of logical certainty that brings to population inferences. Meanwhile, medicine focuses on biochemistry and physiology, and the satisfying – and false – sense of mechanistic certainty about results those bring to individual events.

Bad training and traditions coupled with human desires for clear-cut conclusions have led some impressively credentialed (and often otherwise competent) teams into nonsensical but established statistical malpractices. The common practice of inferring no association, or no effect, or no effect modification because P>0.05 or the confidence interval contains the null shows that even basic statistics are beyond proper understanding and use of many researchers. The full benefits of more refined causal methods will not be realized if their outputs are abused in this way, and there is a sound basis for fears that new methodologies may worsen overconfidence problems thanks to their more sophisticated appearance.

In sum, while innovations should be encouraged, special attention should be paid to how erroneous inferences will arise from misconceptions about input assumptions, algorithms, and outputs, and from pressure to overstate inferences. As the significance-testing tragedy shows, misunderstandings and misapplications can easily nullify any benefit of innovation. In particular, treating models as black boxes for synthesizing input information (assumptions, models, prior beliefs, and data) into formal “inferences” (whether *P*-values, interval estimates, or posterior probabilities) encourages dangerous overconfidence.

These hazards show the need to broaden methods education to cover not only newer developments, but also to cover the failings of logic and cognition that plague judgments, interpretations, and inferences. For this purpose, case studies of misjudgments and mistakes will be essential, but may mislead if not conducted within the strictest practicable neutrality and disclosure standards, and with awareness of the powerful cognitive biases that can affect analyses, reports, and reviews^{15}.

In these papers, “identification” has its strict statistical meaning of *estimability* rather than its more recent epidemiologic meaning of qualitative identification, as in Schwartz et al. (2016) and VanderWeele (2016).

Holland called this RPOA “Rubin’s causal model,” even though such models had long been in use in experimental analysis (e.g., Neyman 1923; Welch 1937; Wilk 1955). Rubin’s seminal contributions extended the models to statistical analysis of nonexperiments (Rubin 1990).

Apart from invertible equation systems, which do not exist in realistic models for health and social phenomena.

In a common notation, knowing the treatment indicator X is positively associated with the outcome indicator Y_{0} under X=0 yet unassociated with the observed outcome Y_{obs} leads us to infer that Y_{1} > Y_{0} for some observed unit.

In terms of potential outcomes Y_{x} indexed by drug dose x, we would infer that the unobserved variable Y_{40} is sometimes greater than Y_{0} even though Pr(X=40) = 0.

In notation: With Y_{x} the potential outcome and Y_{obs} its measurement, we can have Y_{obs} ≠ Y_{x} yet still infer that Y_{1} ≠ Y_{0} for some unit.

Adding confusion, the term ‘consistency' is already well established for unrelated concepts such as estimator convergence and freedom from contradiction.

Even more startling is that temporality (cause preceding effect) is not necessary in some counterfactual accounts of causation (Price 1996).

In logic this syntactical structure is called a theory, and the interpretations that follow that structure are called models of the theory. I instead call this structure a model, which I think more in line with common usage in statistics and applied sciences.

This fact is one way of seeing why quantum physics has defied classical causal explanations: Robins et al. (2015) show that potential-outcome models obey Bell’s inequality, whose observed violations conflict with local definiteness (local realism, local hidden variables) and local causal diagrams (Gill 2014).

Notably, similar concerns about untestable mathematical theory arise in hard sciences like physics (Ellis and Silk 2014).

When in addition to the causal graph we can assume faithfulness (open paths imply association), the number of logically possible structures is reduced drastically – to the point that a certain limited type of conditional causal identification can be enabled (Spirtes et al. 2001; Robins et al. 2003).

One measure of the evidence against a model (whether causal or not) supplied by the *P*-value *p* from a test of its fit is the binary information or surprisal −log_{2}(*p*).

## Acknowledgements

I am deeply indebted to many colleagues for extensive comments and correspondence on the initial draft of this paper, including Alex Broadbent, Jan Vandenbroucke, Neil Pearce, Ashley Naimi, Jay Kaufman, Sharon Schwartz, Nicolle Gatto, Ulka Campbell, George Maldonado, Alfredo Morabia, James Robins, and Tyler VanderWeele. Any errors that remain are solely my responsibility.