The sections above deal with the most basic misconceptions regarding the nature of P-values, but critics of P-values usually focus on other important issues. In this section I will deal with the significance filter, multiple comparisons and some forms of P-hacking, and I need to point out immediately that most of the issues are not specific to P-values even if some of them are enabled by the unfortunate dichotomisation of P-values into significant and not significant. In other words, the practical problems with P-values are largely the practical problems associated with the misuse of P-values and with sloppy statistical inference generally.
3.1 The Significance Filter Exaggeration Machine
It is natural to assume that the effect size observed in an experiment is a good estimate of the true effect size, and in general that can be true. However, there are common circumstances where the observed effect size consistently overestimates the true, sometimes wildly so. The overestimation depends on the facts that experimental results exaggerating the true effect are more likely to be found statistically significant, and that we pay more attention to the significant results and are more likely to report them. The key to the effect is selective attention to a subset of results – the significant results – and so the process is appropriately called the significance filter.
If there is nothing untoward in the sampling mechanism,Footnote 11 sample means are unbiassed estimators of population means and sample-based standard deviations are nearly unbiassed estimators of population standard deviations.Footnote 12 Because of that we can assume that, on average, a sample mean provides a sensible ‘guesstimate’ for the population parameter and, to a lesser degree, so does the observed standard deviation. That is indeed the case for averages over all samples, but it cannot be relied upon for any particular sample. If attention has been drawn to a sample on the basis that it is ‘statistically significant’, then that sample is likely to offer an exaggerated picture of the true effect. The phenomenon is usually called the significance filter. The way it works is fairly easily described but, as usual, there are some complexities in its interpretation.
Say we are in the position to run an experiment 100 times with random samples of n = 5 from a single normally distributed population with mean μ = 1 and standard deviation σ = 1. We would expect that, on average, the sample means, \( \overline{x} \) would be scattered symmetrically around the true value of 1, and the sample-based standard deviations, s, would be scattered around the true value of 1, albeit slightly asymmetrically. A set of 100 simulations matching that scenario show exactly that result (see the left panel of Fig. 6), with the median of \( \overline{x} \) being 0.97 and the median of s being 0.94, both of which are close to the expected values of exactly 1 and about 0.92, respectively. If we were to pay attention only to the results where the observed P-value was less than 0.05 (with the null hypothesis being that the population mean is 0), then we get a different picture because the values are very biassed (see the right panel of Fig. 6). Among the ‘significant’ results the median sample mean is 1.2 and the median standard deviation is 0.78.
The systematic bias of mean and standard deviation among ‘significant’ results in those simulations might not seem too bad, but it is conventional to scale the effect size as the standardised ratio \( \overline{x} / s \),Footnote 13 and the median of that ratio among the ‘significant’ results is fully 50% larger than the correct value. What’s more, the biasses get worse with smaller samples, with smaller true effect sizes, and with lower P-value thresholds for ‘significance’.
It is notable that even the results with the most extreme exaggeration of effect size in Fig. 6 – 550% – would not be counted as an error within the Neyman–Pearsonian hypothesis testing framework! It would not lead to the false rejection of a true null or to an inappropriate failure to reject a false null and so it is neither a type I nor a type II error. But it is some type of error, a substantial error in estimation of the magnitude of the effect. The term type M error has been devised for exactly that kind of error (Gelman and Carlin 2014). A type M error might be underestimation as well as overestimation, but overestimation is more common in theory (Lu et al. 2018) and in practice (Camerer et al. 2018).
The effect size exaggeration coming from the significance filter is not a result of sampling, or of significance testing, or of P-values. It is a result of paying extra attention to a subset of all results – the ‘significant’ subset.
The significance filter presents a peculiar difficulty. It leads to exaggeration on average, but any particular result may well be close to the correct size whether it is ‘significant’ or not. A real-world sample mean of, say, \( \overline{x}=1.5 \) might be an exaggeration of μ = 1, it might be an underestimation of μ = 2, or it might be pretty close to μ = 1.4 and there would be no way to be certain without knowing μ, and if μ were known then the experiment would probably not have been necessary in the first place. That means that the possibility of a type M error looms over any experimental result that is interesting because of a small P-value, and that is particularly true when the sample size is small. The only way to gain more confidence that a particular significant result closely approximates the true state of the world is to repeat the experiment – the second result would not have been run through the significance filter and so its results would not have a greater than average risk of exaggeration and the overall inference can be informed by both results. Of course, experiments intended to repeat or replicate an interesting finding should take the possible exaggeration into account by being designed to have higher power than the original.
3.2 Multiple Comparisons
Multiple testing is the situation where the tension between global and local considerations is most stark. It is also the situation where the well-known jelly beans cartoon from XKCD.com is irresistible (Fig. 7). The cartoon scenario is that jelly beans were suspected of causing acne, but a test found “no link between jelly beans and acne (P > 0.05)”, and so the possibility that only a certain colour of jelly bean causes acne is then entertained. All 20 colours of jelly bean are independently tested, with only the result from green jelly beans being significant, “(P < 0.05)”. The newspaper headline at the end of the cartoon mentions only the green jelly beans result, and it does that with exaggerated certainty. The usual interpretation of that cartoon is that the significant result with green jelly beans is likely to be a false positive because, after all, hypothesis testing with the threshold of P < 0.05 is expected to yield a false positive one time in 20, on average, when the null is true.
The more hypothesis tests there are, the higher the risk that one of them will yield a false positive result. The textbook response to multiple comparisons is to introduce ‘corrections’ that protect an overall maximum false positive error rate by adjusting the threshold according to the number of tests in the family to give protection from inflation of the family-wise false positive error rate. The Bonferroni adjustment is the best-known method, and while there are several alternative ‘corrections’ that perform a little better, none of those is nearly as simple. A Bonferroni adjustment for the family of experiments in the cartoon would preserve an overall false positive error rate of 5% by setting a threshold for significance of 0.05∕20 = 0.0025 in each of the 20 hypothesis tests.Footnote 14 It must be noted that such protection does not come for free, because adjustments for multiplicity invariably strip statistical power from the analysis.
We do not know whether the ‘significant’ link between green jelly beans and acne would survive a Bonferroni adjustment because the actual P-values were not supplied,Footnote 15 but as an example, a P-value of 0.003, low enough to be quite encouraging as the result of a significance test, would be ‘not significant’ according to the Bonferroni adjustment. Such a result that would present us with a serious dilemma because the inference supported by the local evidence would be apparently contradicted by global error rate considerations. However, that contradiction is not what it seems because the null hypothesis of the significance test P-value is a different null hypothesis from that tested by the Bonferroni-adjusted hypothesis test. The significance test null concerns only the green jelly beans whereas the null hypothesis of the Bonferroni is an omnibus null hypothesis that says that the link between green jelly beans on acne is zero and the link between purple jelly beans on acne is zero and the link between brown jelly beans is zero, and so on. The P-value null hypothesis is local and the omnibus null is global. The global null hypothesis might be appropriate before the evidence is available (i.e. for power calculations and experimental planning), but after the data are in hand the local null hypothesis concerning just the green jelly beans gains importance.
It is important to avoid being blinded to the local evidence by a non-significant global. After all, the pattern of evidence in the cartoon is exactly what would be expected if the green colouring agent caused acne: green jelly beans are associated with acne but the other colours are not. (The failure to see an effect of the mixed jelly beans in the first test is easily explicable on the basis of the lower dose of green.) If the data from the trial of green jelly beans is independent of the data from the trials of other colours, then there is no way that the existence of those other data – or their analysis – can influence the nature of the green data. The green jelly bean data cannot logically have been affected by the fact that mauve and beige jelly beans were tested at a later point in time – the subsequent cannot affect the previous – and the experimental system would have to be bizarrely flawed for the testing of the purple or brown jelly beans to affect the subsequent experiment with green jelly beans. If the multiplicity of tests did not affect the data, then it is only reasonable to say that it did not affect the evidence.
The omnibus global result does not cancel the local evidence, or even alter it, and yet the elevated risk of a false positive error is real. That presents us with a dilemma and, unfortunately, statistics does not provide a way around it. Global error rates and local evidence operate in different logical spaces (Thompson 2007) and so there can be no strictly statistical way to weigh them together. All is not lost, though, because statistical limitations do not preclude thoughtful integration of local and the global issues when making inferences. We just have to be more than normally cautious when the local and global pull in different directions. For example, in the case of the cartoon, the evidence in the data favours the idea that green jelly beans are linked with acne (and if we had an exact P-value then we could specify the strength of favouring) but because the data were obtained by a method with a substantial false positive error rate we should be somewhat reluctant to take that evidence at face value. It would be up to the scientist in the cartoon (the one with safety glasses) to form a provisional scientific conclusion regarding the effect of green jelly beans, even if that inference is that any decision should be deferred until more evidence is available. Whatever the inference, the evidence, theory, the method, any other corroborating or rebutting information should all be considered and reported.
A man or woman who sits and deals out a deck of cards repeatedly will eventually get a very unusual set of hands. A report of unusualness would be taken differently if we knew it was the only deal made, or one of a thousand deals, or one of a million deals, etc. – Tukey (1991, p. 133)
In isolation the cartoon experiments are probably only sufficient to suggest that the association between green jelly acne is worthy of further investigation (with the earnestness of that suggestion being inversely related to the size of the relevant P-value). The only way to be in a position to report an inference concerning those jelly beans without having to hedge around the family-wise false positive error rate and the significance filter is to re-test the green jelly beans. New data from a separate experiment will be free from the taint of elevated family-wise error rates and untouched by the significance filter exaggeration machine. And, of course, all of the original experiments should be reported alongside the new, as well as reasoned argument incorporating corroborating or rebutting information and theory.
The fact that a fresh experiment is necessary to allow a straightforward conclusion about the effect of the green jelly beans means that the experimental series shown in the cartoon is a preliminary, exploratory study. Preliminary or exploratory research is essential to scientific progress and can merit publication as long as it is reported completely and openly as preliminary. Too often scientists fall into the pattern of misrepresenting the processes that lead to their experimental results, perhaps under the mistaken assumption that science has to be hypothesis driven (Medawar 1963; du Prel et al. 2009; Howitt and Wilson 2014). That misrepresentation may take the form of a suggestion, implied or stated, that the green jelly beans were the intended subject of the study, a behaviour described as HARKing for hypothesising after the results are known, or cherry picking where only the significant results are presented. The reason that HARKing is problematical is that hypotheses cannot be tested using the data that suggested the hypothesis in the first place because those data always support that hypothesis (otherwise they would not be suggesting it!), and cherry picking introduces a false impression of the nature of the total evidence and allows the direct introduction of experimenter bias. Either way, focussing on just the unusual observations from a multitude is bad science. It takes little effort and few words to say that 20 colours were tested and only the green yielded a statistically significant effect, and a scientist can (should) then hypothesise that green jelly beans cause acne and test that hypothesis with new data.
3.3 P-hacking
P-hacking is where an experiment or its analysis is directed at obtaining a small enough P-value to claim significance instead of being directed at the clarification of a scientific issue or testing of a hypothesis. Deliberate P-hacking does happen, perhaps driven by the incentives built into the systems of academic reward and publication imperatives, but most P-hacking is accidental – honest researchers doing ‘the wrong thing’ through ignorance. P-hacking is not always as wrong as might be assumed, as the idea of P-hacking comes from paying attention exclusively to global consideration of error rates, and most particularly to false positive error rates. Those most stridently opposed to P-hacking will point to the increased risk of false positive errors, but rarely to the lowered risk of false negative errors. I will recklessly note that some categories of P-hacking look entirely unproblematical when viewed through the prism of local evidence. The local versus global distinction allows a more nuanced response to P-hacking.
Some P-hacking is outright fraud. Consider this example that has recently come to light:
One sticking point is that although the stickers increase apple selection by 71%, for some reason this is a p value of .06. It seems to me it should be lower. Do you want to take a look at it and see what you think. If you can get the data, and it needs some tweeking, it would be good to get that one value below .05.
– Email from Brian Wansink to David Just on Jan. 7, 2012. – Lee (2018)
I do not expect that any readers would find P-hacking of that kind to be acceptable. However, the line between fraudulent P-hacking and the more innocent P-hacking through ignorance is hard to define, particularly so given the fact that some behaviours derided as P-hacking can be perfectly legitimate as part of a scientific research program. Consider this cherry picked listFootnote 16 of responses to a P-value being greater than 0.05 that have been described as P-hacking (Motulsky 2014):
-
Analyse only a subset of the data;
-
Remove suspicious outliers;
-
Adjust data (e.g. divide by body weight);
-
Transform the data (i.e. logarithms);
-
Repeat to increase sample size (n).
Before going any further I need to point out that Motulsky has a more realistic attitude to P-hacking than might be assumed from my treatment of his list. He writes: “If you use any form of P-hacking, label the conclusions as ‘preliminary’.” (Motulsky 2014, p. 1019).
Analysis of only a subset of the data is illicit if the unanalysed portion is omitted in order to manipulate the P-value, but unproblematical if it is omitted for being irrelevant to the scientific question at hand. Removal of suspicious outliers is similar in being only sometimes inappropriate: it depends on what is meant by the term “outlier”. If it indicates that a datum is a mistake such as a typographical or transcriptional error, then of course it should be removed (or corrected). If an outlier is the result of a technical failure of a particular run of the experimental, then perhaps it should be removed, but the technical success or failure of an experimental run must not be judged by the influence of its data on the overall P-value. If with word outlier just denotes a datum that is further from the mean than the others in the dataset, then omit it at your peril! Omission of that type of outlier will reduce the variability in the data and give a lower P-value, but will markedly increase the risk of false positive results and it is, indeed, an illicit and damaging form of P-hacking.
Adjusting the data by standardisation is appropriate – desirable even – in some circumstances. For example, if a study concerns feeding or organ masses, then standardising to body weight is probably a good idea. Such manipulation of data should be considered P-hacking only if an analyst finds a too large P-value in unstandardised data and then tries out various re-expressions of the data in search of a low P-value, and then reports the results as if that expression of the data was intended all along. The P-hackingness of log-transformation is similarly situationally dependent. Consider pharmacological EC50s or drug affinities: they are strictly bounded at zero and so their distributions are skewed. In fact the distributions are quite close to log-normal and so log-transformation before statistical analysis is appropriate and desirable. Log-transformation of EC50s gives more power to parametric tests and so it is common that significance testing of logEC50s gives lower P-values than significance testing of the un-transformed EC50s. An experienced analyst will choose the log-transformation because it is known from empirical and theoretical considerations that the transformation makes the data better match the expectations of a parametric statistical analysis. It might sensibly be categorised as P-hacking only if the log-transformation was selected with no justification other than it giving a low P-value.
The last form of P-hacking in the list requires a good deal more consideration than the others because, well, statistics is complicated. That consideration is facilitated by a concrete scenario – a scenario that might seem surprisingly realistic to some readers. Say you run an experiment with n = 5 observations in each of two independent groups, one treated and one control, and obtain a P-value of 0.07 from Student’s t-test. You might stop and integrate the very weak evidence against the null hypothesis into your inferential considerations, but you decide that more data will clarify the situation. Therefore you run some extra replicates of the experiment to obtain a total of n = 10 observations in each group (including the initial 5), and find that the P-value for the data in aggregate is 0.002. The risk of the ‘significant’ result being a false positive error is elevated because the data have had two chances to lead you to discard the null hypothesis. Conventional wisdom says that you have P-hacked. However, there is more to be considered before the experiment is discarded.
Conventional wisdom usually takes the global perspective. As mentioned above, it typically privileges false positive errors over any other consideration, and calls the procedure invalid. However, the extra data has added power to the experiment and lowered the expected P-value for any true effect size. From a local evidence point of view, increasing the sample increases the amount of evidence available for use in inference, which is a good thing. Is extending an experiment after the statistical analysis a good thing or a bad thing? The conventional answer is that it is a bad thing and so the conventional advice is don’t do it! However, a better response might balance the bad effect of extending the experiment with the good. Consideration of the local and global aspects of statistical inference allows a much more nuanced answer. The procedure described would be perfectly acceptable for a preliminary experiment.
Technically the two-stage procedure in that scenario allows optional stopping. The scenario is not explicit, but it can be discerned that the stopping rule was, in effect, run n = 5 and inspect the P-value; if it is small enough, then stop and make inferences about the null hypothesis; if the P-value is not small enough for the stop but nonetheless small enough to represent some evidence against the null hypothesis, add an extra 5 observations to each group to give n = 10, stop, and analyse again. We do not know how low the interim P-value would have to be for the protocol to stop, and we do not know how high it could be and the extra data still be gathered, but no matter where those thresholds are set, such stopping rules yield false positive rates higher than the nominal critical value for stopping would suggest. Because of that, the conventional view (the global perspective, of course) is that the protocol is invalid, but it would be more accurate to say that such a protocol would be invalid unless the P-value or the threshold for a Neyman–Pearsonian dichotomous decision is adjusted as would be done with a formal sequential test. It is interesting to note that the elevation of false positive rate is not necessarily large. Simulations of the scenario as specified and with P < 0.1 as the threshold for continuing show that the overall false positive error rate would be about 0.008 when the critical value for stopping at the first stage is 0.005, and about 0.06 when that critical value is 0.05.
The increased rate of false positives (global error rate) is real, but that does not mean that the evidential meaning of the final P-value of 0.002 is changed. It is the same local evidence against the null as if it was obtained from a simpler one stage protocol with n = 10. After all, the data are exactly the same as if the experimenter had intended to obtain n = 10 from the beginning. The optional stopping has changed the global properties of the statistical procedure but not the local evidence which contained in the actualised data.
You might be wondering how it is possible that the local evidence be unaffected by a process that increases the global false positive error rate. The rationale is that the evidence is contained within the data but the error rate is a property of the procedure – evidence is local and error rates are global. Recall that false positive errors can only occur when the null hypothesis is true. If the null is true, then the procedure has increased the risk of the data leading us to a false positive decision, but if the null is false, then the procedure has decreased the risk of a false negative decision. Which of those has paid out in this case cannot be known because we do not know the truth of this local null hypothesis. It might be argued that an increase in the global risk of false positive decisions should outweigh the decreased risk of false negatives, but that is a value judgement that ought to take into account particulars of the experiment in question, the role of that experiment in the overall study, and other contextual factors that are unspecified in the scenario and that vary from circumstance to circumstance.
So, what can be said about the result of that scenario? The result of P = 0.002 provides moderately strong evidence against the null hypothesis, but it was obtained from a procedure with sub-optimal false positive error characteristics. That sub-optimality should be accounted for in the inferences that made from the evidence, but it is only confusing to say that it alters the evidence itself, because it is the data that contain the evidence and the sub-optimality did not change the data. Motulsky provides good advice on what to do when your experiment has the optional stopping:
-
For each figure or table, clearly state whether or not the sample size was chosen in advance, and whether every step used to process and analyze the data was planned as part of the experimental protocol.
-
If you used any form of P-hacking, label the conclusions as “preliminary.”
Given that basic pharmacological experiments are often relatively inexpensive and quickly completed one can add to that list the option of also corroborating (or not) those results with a fresh experiment designed to have a larger sample size (remember the significance filter exaggeration machine) and performed according to the design. Once we move beyond the globalist mindset of one-and-done such an option will seem obvious.
3.4 What Is a Statistical Model?
I remind the reader that this chapter is written under the assumption that pharmacologists can be trusted to deal with the full complexity of statistics. That assumption gives me licence to discuss unfamiliar notions like the role of the statistical model in statistical analysis. All too often the statistical model is often invisible to ordinary users of statistics and that invisibility encourages thoughtless use of flawed and inappropriate models, thereby contributing to the misuse of inferential statistics like P-values.
A statistical model is what allows the formation of calibrated statistical inferences and non-trivial probabilistic statements in response to data. The model does that by assigning probabilities to potential arrangements of data. A statistical model can be thought of as a set of assumptions, although it might be more realistic to say that a chosen statistical model imposes a set of assumptions onto the experimenter.
I have often been struck by the extent to which most textbooks, on the flimsiest of evidence, will dismiss the substitution of assumptions for real knowledge as unimportant if it happens to be mathematically convenient to do so. Very few books seem to be frank about, or perhaps even aware of, how little the experimenter actually knows about the distribution of errors in his observations, and about facts that are assumed to be known for the purposes of statistical calculations.
– Colquhoun (1971, p. v)
Statistical models can take a variety of forms (McCullagh 2002), but the model for the familiar Student’s t-test for independent samples is reasonably representative. That model consists of assumed distributions (normal) of two populations with parameters mean (μ
1 and μ
2) and standard deviation (σ
1 and σ
2),Footnote 17 and a rule for obtaining samples (e.g. a randomly selected sample of n = 6 observations from each population). A specified value of the difference between means serves as the null hypothesis, so \( {H}_0:{\mu}_1-{\mu}_2={\delta}_{H_0} \). The test statistic isFootnote 18
$$ t=\frac{\left({\overline{x}}_1-{\overline{x}}_2\right)-{\delta}_{H_0}}{s_p\sqrt{1 / {n}_1+1 / {n}_2}}\kern1.00em $$
where \( \overline{x} \) is a sample mean and s
p is the pooled standard deviation. The explicit inclusion of a null hypothesis term in the equation for t is relatively rare, but it is useful because it shows that the null hypothesis is just a possible value of the difference between means. Most commonly the null hypothesis says that the difference between means is zero – it can be called a ‘nill-null’ – and in that case the omission of \( {\delta}_{H_0} \) from the equation makes no numerical difference.
Values of t calculated by that equation have a known distribution when \( {\mu}_1-{\mu}_2={\delta}_{H_0} \), and that distribution is Student’s t-distribution.Footnote 19 Because the distribution is known it is possible to define hypothesis test acceptance regions for any level of α for a hypothesis test, and any observed t-value can be converted into a P-value in a significance test.
An important problem that a pharmacologist is likely to face when using a statistical model is that it is just a model. Scientific inferences are usually intended to communicate something about the real world, not the mini world of a statistical model, and the connection between a model-based probability of obtaining a test statistic value and the state of the real world is always indirect and often inscrutable. Consider the meaning conveyed by an observed P-value of 0.002. It indicates that the data are strange or unusual compared to the expectations of the statistical model when the parameter of interest is set to the value specified by the null hypothesis. The statistical model expects a P-value of, say, 0.002 to occur only two times out of a thousand on average when the null is true. If such a P-value is observed, then one of these situations has arisen:
-
a two in a thousand accident of random sampling has occurred;
-
the null hypothesised parameter value is not close to the true value;
-
the statistical model is flawed or inapplicable because one or more of the assumptions underlying its application are erroneous.
Typically only the first and second are considered, but the last is every bit as important because when the statistical model is flawed or inapplicable then the expectations of the model are not relevant to the real-world system that spawned the data. Figure 8 shows the issue diagrammatically. When we use that statistical inference to inform inferences about the real world we are implicitly assuming: (1) that the real-world system that generated the data is an analogue to the population in the statistical model; (2) that the way the data were obtained is well described by the sampling rule of the statistical model; and (3) that the observed data is analogous to the random sample assumed in the statistical model. To the degree that those assumptions are erroneous there is degradation of the relevance of the model-based statistical inference to the real-world inference that is desired.
Considerations of model applicability are often limited to the population distribution (is my data normal enough to use a Student’s t-test?) but it is much more important to consider whether there is a definable population that is relevant to the inferential objectives and whether the experimental units (“subjects”) approximate a random sample. Cell culture experiments are notorious for having ill-defined populations, and while experiments with animal tissues may have a definable population, the animals are typically delivered from an animal breeding or holding facility and are unlikely to be a random sample. Issues like those mean that the calibration of uncertainty offered by statistical methods might be more or less uncalibrated. For good inferential performance in the real world, there has to be a flexible and well-considered linking of model-based statistical inferences and scientific inferences concerning the real world.