Levi’s objection says, to repeat, that there is an ambiguity in premises (2) and (3) of AIR: that the decision referred to in (2) is a decision about what to believe, while the decision referred to in (3) is a decision about how to act, that only the latter presupposes value judgments, and that the scientist qua scientist only needs to decide what to believe. Levi (1962, p. 48) argues that decisions about what to believe precede the acceptance or rejection of hypotheses in a non-behavioral sense, while decisions about how to act precede the acceptance or rejection of hypotheses in a behavioral sense. Thus, while Jeffrey, Douglas, Wilholt, and Schurz all understand (categorical) acceptances of hypotheses as acceptances in a behavioral sense, Levi (1960, p. 349) allows for the possibility of “accepting H in an ‘open-ended’ situation where there is no specific objective”.Footnote 3
Levi (1962, p. 49) understands a scientist who accepts or rejects hypotheses in a non-behavioral sense as one who seeks “the truth and nothing but the truth”: as one who selects the true proposition from a set of competing possible propositions on the basis of the relevant evidence. Levi (1962, p. 51) points out that two constraints operate in the search for truth and nothing but the truth. The first (“hypothesis impartiality”) requires that (a) the scientist do not prefer that any proposition from a set of competing propositions be true rather than another. The second (“error impartiality”) requires that (b) she do not regard any possible error as being more serious that another.Footnote 4 I am going to deal with error impartiality first, and with hypothesis impartiality second.
Remember that Douglas points out that significance level α for the maximum probability of committing a type I error and significance level β for the maximum probability of committing a type II error trade of each other, and that there seems to be no way to determine α and β without passing a value judgment about the respective consequences of selecting a high α or β. It seems impossible that a researcher can exactly balance these consequences and regard both errors as being equally serious. So how does Levi argue in favor of the possibility of error impartiality?
His argument involves the proposal that outcomes that fall outside the critical region should lead to suspension of judgment rather than acceptance of the null hypothesis, where the critical region is the region of test statistic values that result in rejection of the null hypothesis (in the example of Section 3, this is the region where z > 1.645 if α = 0.05 and z > 2.331 if α = 0.01). Under this proposal, the result of rejecting a true null hypothesis is a mistake, while the result of suspending judgment about a false null hypothesis is not. Scientists may accordingly take type I errors more seriously than type II ‘errors’ without violating error impartiality (cf. Levi, 1962, pp. 62–3). Levi points out that significance level α remains a matter of choice on the part of the investigator. But he also believes that α serves as “a rough index of the degree of caution exercised in a search for truth” (Levi, 1962, p. 63). He doesn’t think that this index presupposes value judgments.
One might respond that Levi’s proposal only pushes back the problem of selecting adequate significance levels to the stage of selecting the null hypothesis. In Douglas’s example, high α will lead to an over-regulation (under-regulation) of the dioxin-producing parts of the industry if the null hypothesis says that dioxins don’t cause (cause) liver cancer. If it is impossible for the scientist to remain impartial about null hypotheses and their alternatives, it will be impossible for her to remain impartial about errors. So how does Levi argue for the possibility of hypothesis impartiality?
His argument involves a well-known doctrine of classical hypothesis testing that Levi (1962, p. 62) quotes when saying that “the null hypothesis is to be selected in such a way that type I error will be more serious than type II error”. What needs to be understood, however, is that the seriousness in question is in the eye of the truth-seeking scientist. In the eye of that scientist, type I error (of rejecting a true null hypothesis) will be more serious than type II error (of accepting a false null hypothesis) as long as rejecting the null hypothesis amounts to a scientific discovery.
Classical hypothesis testing can be employed to test for all sorts of hypotheses: the hypothesis that a particular consignment of tulip bulbs contains 40 percent of the yellow- and 60 percent of the red-flowering sort, the hypothesis that the prevalence of cardiovascular disease among non-smokers in a given population is 10 percent and so on. In such cases I am unsure whether the truth-seeking scientist can select the null hypothesis in such a way that type I error will be more serious than type II error. But as long as rejecting a hypothesis amounts to a scientific discovery, the scientist needs to select that hypothesis as null hypothesis. The reason is that future research will build on that discovery: if the entity discovered is a causal relation, scientists will investigate the underlying mechanism; if the entity is a new particle, scientists will investigate its properties and so on. If rejecting the null hypothesis amounts to a scientific discovery, type I error will be more serious because future research would be misguided if a true null hypothesis were rejected.
Wilholt (2009, pp. 95–6) claims that Levi’s conception of a decision about what to believe “presupposes a sense of purity of epistemic activity that is exaggerated and unrealistic”. In the present and following section, I will concede that this claim is roughly correct. I will also argue, however, that what contaminates the epistemic activity of deciding what to believe is not necessarily a value judgment. In the rest of the present section, I will deal with the less controversial case of the Higgs boson discovery. In the following section, I will turn to the more controversial case of science with clear non-epistemic impacts.
The Higgs boson discovery is the result of one comprehensive hypothesis test relying on the following statistical model (cf. van Dyk, 2014, p. 55):
$${N}_{msc}\sim \mathrm{Poisson}\left[{\beta }_{sc}\left({\theta }_{sc},m\right)+{\kappa }_{sc}\left({\phi }_{sc},m\right)\mu \right]$$
I will first explain the variables and parameters and then their subscripts. N is the number of observed events. An event is a proton-proton collision produced by the Large Hadron Collider (LHC) near Geneva. A proton-proton collision results in a trajectory of final-state particles. Particle detectors identify these particles by determining their momenta and/or energy. At the LHC, there are seven particle detectors. Two of them, ATLAS and CMS, were involved in the discovery of a Higgs boson. They identified this particle primarily through two decay channels: a Higgs boson decay into two photons (H → γγ) and a Higgs boson decay into two Z bosons (H → ZZ), each of which decays, in turn, into two leptons (either electrons or muons).
The LHC produces millions of events per second but not of all of these are “observed”. The particle detectors have triggers that make very fast decisions as to whether an event is interesting or uninteresting, where an event is uninteresting if it involves well-understood physics. Rather than storing all events, the particle detectors save only those events that the triggers decide are interesting; these events amount to approximately 100 per second. Although this is a small fraction of the total events, it could still result in 1010 saved events for each experiment over the expected 15-year life span of the LHC. These saved events are the observed events. Their distribution is a Poisson distribution because counting the observed events is a Bernoulli trial (a random trial with exactly two possible outcomes: Higgs boson decay and no Higgs boson decay) with very large N and extremely low probabilities.
β(θ, m) models the “expected background count”: the number of events that can be expected to occur if all we have is well-understood physics, i.e. if the null hypothesis is true (the number of particle decays that do not represent Higgs boson decays).
κ(ϕ, m) models the “expected Higgs boson count”: the number of events that can be expected to occur if in addition to well-understood physics, there is new physics, i.e. if the null hypothesis is false (the number of particle decays that represent Higgs boson decays).
θ and ϕ are (vectors of) nuisance parameters (i.e. of parameters not primarily of interest: e.g. variances if the mean is a parameter of interest).
m is the Higgs mass associated with a specific “bin”. A bin is a potential Higgs mass on a fine grid of values of mH (the unknown Higgs mass). Once a Higgs boson is discovered, its mass can be estimated by including mH in the model. Estimations indicate that the actual Higgs mass lies in the mass region around 126 GeV (126⋅109 electronvolts).
μ is signal strength: the strength with which a particle decay “signals” its being a Higgs boson decay. Signal strength is a function of the (unknown) Higgs mass: the signal is strongest near the actual Higgs mass. Signal strength is defined so that μ = 0 corresponds to the “background only hypothesis”, i.e. to the hypothesis that the observed particle decays do not represent Higgs boson decays, and μ = 1 to the hypothesis that the observed particle decays represent Higgs boson and other particle decays. This allows μ = 0 to serve as the null hypothesis subjected to significance testing.
I now turn to the subscripts of the variables and parameters. The subscripts indicate that the choice of both the background and Higgs boson models (their parameters and functional forms) is different for each category (stratum) s of (decay) channel c. The choice is different because once the particle detectors have saved the events that their triggers decide are interesting, events are “cut” within each decay channel, and because the events that survive the cuts are “stratified” into relatively homogenous categories: into categories with homogenous signal-to-background ratios and invariant mass resolutions.
The cuts aim to focus the analysis on a subset of events wherein new physical particles are more likely to be observed. The fraction of events that survive the cuts and involve new physics can be as low as 10–8. In the actual Higgs boson discovery there were only a few hundred events in the two primary decay channels that could be associated with a Higgs boson decay. Stratification aims to increase the statistical power for identifying possible excess events above background that are due to new physics.
It is not entirely clear whether particle physicists can be said to accept the hypothesis that there are Higgs bosons. At least in official statements, they prefer to say that there is “conclusive evidence for the discovery of a new particle”, and that this evidence “is compatible with the hypothesis that the new particle is the Standard Model Higgs boson” (ATLAS collaborators, 2012, p. 15). But apart from these statements, which might simply express an abundance of caution, the Higgs boson discovery seems to exemplify a decision about what to believe rather well.
Particle physicists selected μ = 0 as null hypothesis, because they believed that type I error (of rejecting a true hypothesis) will be more serious than type II error (of accepting a false hypothesis) if the null hypothesis says that μ = 0. They believed that type I error will be more serious because future research would be misguided if μ = 0 were rejected if true. At the same time, they selected a high “degree of caution” to minimize the risk of rejecting a true null hypothesis: 6σ and 5σ in the ATLAS and CMS experiments, respectively, where a level of 5σ corresponds to a probability of about 1 in 3.4 million.
There have been several reasons for selecting a significance level of 5σ or higher. A purely conventional reason is that 5σ is the significance level that editors of particle physics journals generally require to claim a detection. Dawid (2015, pp. 79–80) cites a purely epistemic reason for selecting 5σ: the attempt to keep in check the so-called look-elsewhere effect, i.e. the problem that the probability of μ = 0 that can be calculated for any category s of channel c (the “global” p-value) is greater than the minimum probability of μ = 0 that can be calculated for a specific category s of channel c (the “local” p-value). Van Dyk (2014, p. 54) mentions a pragmatic reason for selecting 5σ: the attempt to account for model misspecifications that, given the large number of models for each category s of channel c, seem quite likely. Staley (2017b, p. 357) mentions another pragmatic reason: a “consideration of both the negative consequences of an erroneous discovery claim and the value for the further pursuit of inquiry of a correct discovery claim.”
It is true that only one of these reasons is purely epistemic. One may accordingly think that even in the case of the Higgs boson discovery, a decision about what to believe “presupposes a sense of purity of epistemic activity that is exaggerated and unrealistic”. But what contaminates epistemic activity in the case of the Higgs boson discovery is not a value judgment, but a bunch of conventional or pragmatic reasons. A good way to think about the difference is in terms of conditionals, the consequents of which make reference to methodological decisions (select μ = 0 as null hypothesis, select 5σ or higher as significance level and so on). It is only in the case of value judgments that the antecedents refer to valuations of the utility of specific individuals or groups. In the case of conventional or pragmatic reasons, the antecedents make reference to technical goals (that of keeping the look-elsewhere effect in check, that of accounting for model misspecifications and so on).
One may of course object that technical goals are made explicit in official communiqués, while the utility valuations determine the methodological decisions as a matter of fact. One may argue with Staley (2017b, p. 369), for instance, that a false discovery claim would have been “tremendously embarrassing” for the physicists involved in the Higgs search, and that the 5σ significance level was selected to avoid that potential embarrassment.Footnote 5 But the objection is not only “speculative” (as Staley himself admits); it also ignores that the wider community of scientists would not accept methodological decisions if they were taken to increase the utility of specific individuals or groups. It ignores, for instance, that the majority of physicists not involved in the Higgs search would not accept the selection of 5σ as significance level if that selection would not allow for the achievement of specific technical goals.Footnote 6
Staley (2017a, b, p. 368) states that Levi’s view about significance level α seems close to the view expressed by Douglas, and that his (Staley’s) point that 5σ has been selected for a pragmatic reason resembles AIR. This statement seems to suggest that a decision about what to believe is impossible without value judgments even in the case of the Higgs boson discovery. It is important to understand, however, that this impossibility does not follow from Staley’s analysis of the Higgs boson discovery. Staley (2017b, p. 357) concedes that “the negative consequences of an erroneous discovery claim and the value for the further pursuit of inquiry of a correct discovery claim” can be regarded as epistemic values in a sense proposed by Steel. Staley prefers to think of these values as pragmatic, but what is clear is that they do not qualify as non-epistemic in the sense of value judgments. Thus even under Staley’s analysis, the Higgs boson discovery exemplifies a decision about what to believe on the part of physicists: a decision that is epistemically pure in the sense of not depending on value judgments, but not epistemically pure in the sense of not depending on pragmatic considerations.