The p value
The p value is defined as the probability of what was observed or an even more extreme observation, assuming that the null hypothesis \(\hbox {H}_{0}\) is true.Footnote 6 To illustrate, suppose we are interested in the true effect size R (where R might be, e.g., a correlation coefficient). We take as our null hypothesis \(\hbox {H}_{0} : R = 0\) (no correlation). We draw a random sample of individuals, on the basis of which we calculate the observed correlation r in this particular sample. Let’s say we obtain \(r = 0.5\). Then by definition the (two-sided) \(p\,\hbox {value} = \textit{Pr}[{\vert }r{\vert } \ge 0.5 {\vert } R = 0]\) (Fig. 1). The calculation of this probability requires a model, or distribution; the Figure illustrates a case in which the distribution of r under \(\hbox {H}_{0}\) is normal. This model determines the shape of the curve with respect to which the “tail” probability (that is, the area in the shaded “tails” of the distribution) is computed.
A small p value is generally interpreted as evidence against \(\hbox {H}_{0}\). Large p values, on the other hand, are not interpreted as evidence for \(\hbox {H}_{0}\). A large p value could correspond to (possibly weak) evidence against \(\hbox {H}_{0}\), or to evidence supporting \(\hbox {H}_{0}\). The p value does not provide a mechanism for distinguishing between these two possibilities. This point is related to the fact that calculation of the p value is based on the specification of only one hypothesis, \(\hbox {H}_{0}\). Recall, however, that evidence with respect to any single hypothesis can differ depending on the alternative. We interpret a small p value as evidence against \(\hbox {H}_{0}\) compared to any alternative, in the sense that the data are incompatible (to some degree) with \(\hbox {H}_{0}\) regardless of which alternative is entertained; but we allow for the fact that a large p value might correspond to evidence in favor of \(\hbox {H}_{0}\) against some alternatives but not others.
The p value is commonly used not only to assess the per-study evidence, but also to assess the evidence amalgamated across multiple studies. One approach to doing this is formal meta-analysis, which comes in a variety of flavors, all of which provide a single, summary p value across the input studies; another approach is based on the binary classification of individual studies as either “positive” (statistically significant) or “negative” (not statistically significant), and a set of heuristic procedures for combining results classified in this way. In the remainder of this section we will consider two meta-analytic approaches.Footnote 7 We’ll return to the classification-based approach in Sect. 4.Footnote 8
Meta-analysis I: Combining p values
Suppose we have two studies, \(\hbox {S}_{1}\) and \(\hbox {S}_{2}\), which are replicates in the sense defined above. Say we obtain p values \(\hbox {P}_{1}, \hbox {P}_{2}\) for studies \(\hbox {S}_{1}, \hbox {S}_{2}\), respectively. How do we obtain the combined evidence across the two studies?
We might reason as follows: Let’s assume that the p value, which is the probability of an event (e.g., the event \(e_{1} : [{\vert }r{\vert } \ge 0.50]\), or the event \(e_{2} : [{\vert }r{\vert } \ge 0.10]\)), is also a measure of the evidence. Because the two studies are conducted independently of one another, the probability of both events (say, \(e_{1}\) in \(\hbox {S}_{1}\) and \(e_{2}\) in \(\hbox {S}_{2})\) is the product of their individual probabilities. Therefore, \(\hbox {P}_{1} \times \hbox {P}_{2}\) should represent the combined evidence. The logic seems unassailable, but there is an obvious problem with this approach. By virtue of being a probability, we have \(0 \le \hbox {P}_{i} \le 1\) for all studies i. Thus \(\hbox {P}_{1} \times \hbox { P}_{2} \le \) minimum \((P_{1},\hbox { P}_{2})\), that is, the product of the p values is always smaller than the smaller of the initial p values. If we were to interpret the result of multiplying p values as itself being a measure of evidence, we would have to conclude that the evidence always increases (or stays the same) upon consideration of a second study, regardless of the second study’s data. This is clearly wrong. Thus \(\hbox {P}_{1} \times \hbox { P}_{2}\) cannot be interpreted as a measure of the combined evidence.
In fact, the product of p values is not itself a p value. This seemingly technical detail is important here because it illustrates the need for care in considering the input-output relationship inherent in any statistical amalgamation procedure. The product of the probabilities of two independent events is indeed the probability of the intersection of the events. But as it happens, the distribution of the product of two p values has a somewhat more complicated relationship to the distributions of the individual p values. Recall that the p value is defined as a particular (tail) probability, which is calculated with respect to the distribution of the variable(s) of interest. In our example, in order to find the p value in each of the individual studies, we needed to specify the distribution, assuming \(\hbox {H}_{0}\) to be true, of r. Similarly, if we wish to compute the amalgamated p value, then we need to find the functional form of the distribution, again under H\(_{0}\), of the statistic computed by the amalgamation procedure. This explains the apparent paradox of the preceding paragraph: \(\hbox {P}_{1 }\times \hbox {P}_{2}\) is the correct joint probability, but it does not have the same interpretation as the input p values, because they are tail probabilities under a particular distribution, while \(\hbox {P}_{1 }\times \hbox {P}_{2}\) is not.
In the present case, the amalgamation statistic at hand is the product of the per-study p values, or equivalently, the sum of their natural logarithms. In order to compute the p value corresponding to this sum, one needs to determine the functional form of its distribution. This functional form was derived by Fisher (1925), who showed that \(-\,2\mathop \sum \nolimits _{i=1}^k \ln P_i \sim \chi _{2k}^2 \), where k is the number of studies being considered and \(\hbox {P}_{i}\) is the p value for the ith study. If one wishes to interpret the amalgamation output as a p value, the proper statistical procedure is to sum the logarithms of the per-study p values, multiply this quantity by −2, and then to look up the corresponding tail probability in a table of \(\chi ^{2}\) quantiles with the appropriate degrees of freedom.
To be concrete, let’s consider a numerical example. Let \(\hbox {S}_{1}\) have \(n_{1} = 40\) and \(r_{1} = 0.5.\) This yields a p value of \(\hbox {P}_{1} = 0.002.\)Footnote 9 Let S\(_{2}\) have \(n_{2} = 40\) and \(r_{2} = 0.1\), yielding \(\hbox {P}_{2} = 0.527.\) Then to amalgamate these input p values to obtain the “total” p value \(\hbox {P}_{12}\), we look up the tail probability associated with \(-\,2[\ln (0.002) + \ln (0.527)] = 13.71\) on a \(\chi _{4}^{2} \) table. This amalgamated p value turns out to be 0.008. Note that, unlike the product \(\hbox {P}_{1 }\times \hbox { P}_{2}\) itself, \(0.008 > \hbox {minimum}(\hbox {P}_{1},\hbox { P}_{2})\). Fisher’s method returns, in this particular example, an amalgamated p value that is intermediate between the two input per-study p values, and therefore less significant than the smaller of the two considered on its own.
One way to think about Fisher’s method is in terms of a type of averaging procedure. An average is obtained by summing a set of quantities and dividing by the total number of quantities in the summation. Fisher’s method sums the (logarithms of) the p values, but rather than directly dividing this sum by k, the number of terms in the summation is accounted for by the degrees of freedom (d.f., in this \(\hbox {case} = 2{k} = 4\)) of the \(\chi _{d.f.}^2 \) distribution. This operation is related to averaging, insofar as it will often return a result intermediate between the extremes of the per-study p values. But it is not simple averaging, and there is no guarantee that it will always return an intermediate result. E.g., when both \(\hbox {P}_{1}\) and \(\hbox {P}_{2}\) are large, \(\hbox {P}_{12}\) will tend to be larger than maximum(\(\hbox {P}_{1}\), \(\hbox {P}_{2})\), and when \(\hbox {P}_{1}\) and \(\hbox {P}_{2}\) are both small, \(\hbox {P}_{12}\) will tend to be smaller than minimum\((\hbox {P}_{1}, \hbox {P}_{2})\). But in those cases where meta-analysis is most needed—situations in which not all studies show extreme results in the same direction—Fisher’s method tends to return something akin to an average p value.Footnote 10 Let’s call the arithmetic operation underlying this approach to meta-analysis paveraging (related to, but not the same as P-averaging).
Paveraging p values across our two studies is a correct method for obtaining \(\hbox {P}_{12}\). The inputs are the per-study p values, and the output is a p value interpreted in exactly the same way as the inputs: \(\hbox {P}_{12}\) is the probability of obtaining a test statistic (viz., the sum of the log p values from the input studies) equal to or greater than the observed statistic, assuming \(\hbox {H}_{0}\) is true. But this establishes only that Fisher’s method is a correct amalgamation procedure for something, not necessarily for the evidence. It could turn out that Fisher’s method is the analogue of a correct procedure for ascertaining the total weight of our metal rods, without any explicit procedure for mapping this weight onto the intended object of our measurement, that is, the total length.
Insofar as we are interested solely with the p value per se, we need only be concerned with ascertaining the correct sampling distribution under \(\hbox {H}_{0}\) for any given statistic (including the product of p values), that is, the question of evidential interpretation is irrelevant. But if we are genuinely interested in amalgamation of evidence, then the question of interpretation is crucial. If the per study p values are in fact measures of evidence, then Fisher’s method might plausibly be construed as giving us the amalgamated evidence. But if they are not, there is no basis for assigning an evidential interpretation to the amalgamation result. By the same token, if we cannot justify an evidential interpretation for the amalgamated p value obtained via Fisher’s method, then we must conclude that the per-study p values themselves should not be interpreted as measures of the evidence. We will return to the question of evidential interpretation below, after considering another and far more popular approach to meta-analysis.
Meta-analysis II: Combining parameter estimates
The second approach to meta-analysis employs an amalgamation operation that is related to but not the same as paveraging. It probably owes its popularity both to its flexibility in handling complications beyond the scope of this paper (in particular, extensions to random effects models), but also, to a pleasing intuitive connection with parameter estimation, as we will explain. But as we will see, parameter-based meta-analysis turns out to lack the impeccable logical rationale underlying Fisher’s method, leading to even thornier measurement issues.
We continue with the same example: two studies, each of which summarizes the data in terms of the observed correlation r between the same two things, and each of which tests the null hypothesis \(R = 0\) based on a sample size of n. It is well established by classical statistical theory that (under some broad regularity conditions), the more data we have, the better will be our estimate of R. In this simple setting, the best estimate of R across our two studies is the weighted average \(r_{12}\) of the estimates obtained in each of the two studies, \(r_{1}\) and \(r_{2}\), where (in our simple example with equal sample sizes) \(r_{12} = (1/2)(r_{1}+r_{2})\). Standard meta-analysis across the studies proceeds in two steps: (1) Calculate \(r_{12}\), then (2) find the p value on the combined studies, that is, based on the “combined ” estimate \(r_{12} \) and the combined sample size, which enters through the standard error (s.e.) of the combined estimate.Footnote 11
Continuing with the numerical example from above \((n_{1} = 40, r_{1} = 0.5, \hbox {P}_{1} = 0.002; n_{2} = 40, r_{2} = 0.1, \hbox {P}_{2} = 0.527)\), we first calculate \(r_{12} = (1/2)(0.5 + 0.1) = 0.3.\) Based on this new estimate and the combined sample size, the combined p value \(\hbox {P}_{12} = 0.007\). Two things about this result are noteworthy. First, this is not the same as the result we obtained from Fisher’s method \((\hbox {P}_{12} = 0.008)\).Footnote 12 This is not surprising, given that one technique considers only the per-study p values and the other explicitly takes parameter estimates and sample sizes into account. But it does raise the question of which p value, if either, represents the actual strength of evidence. The second noteworthy feature of this result is that, just as with Fisher’s method, we have \(\hbox {P}_{1}< \hbox {P}_{12} < \hbox {P}_{2}\).Footnote 13
As with Fisher’s method, this form of meta-analysis involves an operation related to averaging in going from the per-study p values to the combined p value. In this case, however, it is not the p values themselves that are averaged (that is, paveraged), but rather, the per-study estimates of r. Let’s call this new operation raveraging. Raveraging takes the (weighted) average estimate of r and calculates the amalgamated p value based on this average and the new sample size under an appropriate null distribution.Footnote 14
In using raveraging as its amalgamation operation, meta-analysis in effect acknowledges that the study-wise p values are not themselves measures of evidence: if they were, then paveraging would seem to be the correct procedure. Raveraging has baked into it the premise that mapping study-wise results onto the amalgamated evidence must involve some function of \(\left( {r_1 ,n_1 } \right) \) and \(\left( {r_2 ,n_2 } \right) \). This suggests that, while the study-wise p values are not themselves measures of evidence, they should be transformable into evidence measures as a function of r and n. Just as with the need for a mapping function from weight to length in our length amalgamation analogy, it seems that one needs to invoke \(\left( {r_i ,n_i } \right) \) to map the p value of the ith study onto the evidence. This is the only explanation that allows us to simultaneously interpret the meta-analytic p value in terms of evidence while justifying an amalgamation procedure based on something other than the per-study p values themselves. But it precludes consideration of the per-study p value per se as an evidence measure in the absence of an explicit formal procedure for mapping it onto the evidence as a function of r and n.
Thus the inputs and outputs of parameter-based meta-analysis are, apparently, not on the same scale, that is, if we interpret the meta-analytic p value itself to be a measure of evidence. Moreover, turning to the question of interpretability vis a vis the underlying evidence, the situation is really quite logically perplexing. This form of meta-analysis appears to allow us to directly interpret its outcome measure—the meta-analytic p value—as an evidence measure, without further consideration of \(r_{12}\) and N, even while it entails a tacit acknowledgment that the per-study p values share no such quality. In fact, the statistical literature tends to support the idea that the p value per se, or taken on its own, is not a measure of evidence, but that it can be interpreted as a measure of evidence if one is careful to take various aspects of context into consideration (Wasserstein and Lazar 2016). Crucial aspects of context are often said to include effect-size (or, more generally, parameter) estimates and sample sizes, which bear on the power of a p value based test to reject \(\hbox {H}_{0 }\) when it is false.Footnote 15 However, there is no explicit rule for these per-study p value transformations in the literature; rather, investigators are instructed to use judgment in taking extraneous factors into consideration when interpreting p values in evidential terms.
Does meta-analysis measure amalgamated evidence?
Let’s first consider raveraging (we will return to paveraging below). If the p value is not a direct measure of evidence, but requires transformation as a function of r and n (and perhaps other things) in order to represent the evidence, then some way of validating the transformation operation is needed. Parameter-based meta-analysis is ingenious in this regard, because raveraging itself incorporates a formal transformation rule at the amalgamation level, without committing to any particular transformation rule at the per-study level. Rather than providing a mechanism for directly transforming a p value onto the evidence scale, it builds a mapping function into the amalgamation procedure itself. Therefore, in order to assess whether r and n are in fact being properly taken into account, one needs to consider whether raveraging is returning the correct amalgamated evidence. Without a formal measure of per-study evidence on the table to begin with, this is a particularly challenging task. But there are some things we can say about it.
In our example, in which the second study yields a parameter estimate with the same sign but smaller than the estimate from the first study, as noted above raveraging leaves us with \(\hbox {P}_{12}\) intermediate between \(\hbox {P}_{1}\) and \(\hbox {P}_{2}\). When \(\hbox {P}_{1}\) and \(\hbox {P}_{2}\) are from studies carried out one after the other, this result implies that the evidence grows weaker (yields a larger p value), relative to the original study, when we take the second study into account. Given the numbers used in this example, we think almost everyone will agree that this result is at least plausible. The reasoning seems to go like this: the true evidence should be the evidence corresponding to the best estimate of r. Since the estimate of r improves with sample size, the fact that \(r_{12} < r_{1}\) indicates that our initial estimate \(r_{1}\) was an overestimate. Then once this number is appropriately adjusted downward (indicating less correlation than we had originally supposed), it seems correct that the evidence against H\(_{0}\) should decrease (that is, correspond to a larger p value) when we take S\(_{2}\) into account. We tend to share the intuition that a decrease in r should produce a reduction in the strength of the evidence.Footnote 16
But at the same time, all other things being equal, we think we can all agree that evidence gets stronger with increasing sample size. Imagine that we had obtained \( r_{2}=r_{1} = 0.5\). Clearly in this case, having doubled the sample size while maintaining the same estimated effect size, we would expect the evidence against \(\hbox {H}_{0}\) to have grown stronger. Returning to the original example, two things are happening simultaneously: we have a change from \(r_{1}\) to \(r_{12}\), which, all other things being equal, might suggest a reduction in the evidence; and we have a change from n to 2n, which, all other things being equal, might entail an increase in the evidence. Intuition stops short of determining whether or not the dampening effect on the evidence of the decrease in r was sufficient to overcome the augmenting effect of the increase in sample size. Intuition may instruct us that the evidence might decrease going from \(r_{1} = 0.5\) to \(r_{2} = 0.1\), but there is no clear basis for an intuition that it did decrease, given the simultaneous doubling of the sample size. Appealing to intuition does not provide us with a means to decide whether raveraging is behaving correctly in this case or not as an evidence amalgamation procedure.
Indeed, we have a third intuition that further complicates things. Consider a legal argument which first points out an exact DNA match between the suspect and a sample taken from the crime scene, and then afterwards notes a blood type match. The first finding gives us relatively strong evidence that the suspect was present at the scene, while the second gives us weaker evidence since blood type matches are far more common. But we do not mentally adjust our original (DNA match based) assessment of the evidence strength downward after hearing about the blood type match. Here strong evidence followed by weak evidence in favor of the same conclusion increases (though perhaps only by a very small increment) the evidence relative to its initial state. By contrast, if the second piece of information had been an eye-witness report of seeing the suspect somewhere other than the crime scene at the time of the crime (which might only be weak evidence, depending on the reliability of the witness, but still evidence in favor of innocence), then the initially strong evidence would be tempered. Whether the evidence goes up or down when we receive the second bit of information seems to be a matter of whether the second bit favors guilt or innocence, rather than the strength of the evidence of the second bit of information relative to the strength of the initial evidence.Footnote 17
It is unclear whether this line of reasoning carries over to the statistical case, but if it does, we would have to say that whether \(\hbox {P}_{1} < \hbox {P}_{12}\) is correct or not ought to depend upon whether \(\hbox {P}_{2}\) is (possibly weak) evidence against\(\hbox {H}_{0}\), or whether \(\hbox {P}_{2}\) is actually evidence for\(\hbox {H}_{0}\). In the former case, it would seem that the total evidence goes up; and only in the latter case does it go down. But remember, as noted at the outset, that based on the p value alone we cannot tell the difference. The larger p value in \(\hbox {S}_{2}\) could correspond to either (possibly weak) evidence against \(\hbox {H}_{0}\) or evidence in favor of \(\hbox {H}_{0}\). This again leaves us with no way to verify whether raveraging is doing the right thing when we interpret it as a measure of total evidence across the two studies.
Note too that all arguments in this section apply equally, if in a somewhat modified form, to meta-analysis based on direct combining of p values. Recall that, for our selected example, Fisher’s method also yielded \(\hbox {P}_{1}< \hbox {P}_{12} <\hbox { P}_{2}\). If we decide in the end that this pattern does not accurately reflect the behavior of the evidence, then this poses as big a challenge to an evidential interpretation of \(\hbox {P}_{12}\) under paveraging as it does to the outcome of raveraging. Since the paveraged p value is demonstrably on the same scale as the input per-study p values, and since the amalgamation operation is logically and mathematically impeccable, we would have to conclude that the per-study p value is not a measure of evidence.
Raveraging, on the other hand, produces as the amalgamation output a p value that apparently has a scale that differs from that of its inputs, that is, if we are to interpret the raveraged p value itself as a direct measure of evidence. Raveraging produces a p value that is fundamentally different from the p value produced by paveraging, not merely because its numerical value may be different given the same set of input studies (arguably a problem in its own right), but because it bears a different relationship to the per-study p values corresponding to its inputs. Both approaches agree that, given the numbers in our example, the p value is larger after consideration of \(\hbox {S}_{2}\). But it is not clear that the evidence has gone down. Which method, if either, is correct? We are left up to this point in a bit of a muddle.
Summary of Sect. 3
We considered two forms of meta-analysis. One combines per-study p values to obtain an amalgamated p value; the other combines parameter estimates to arrive at an appropriately weighted average value, and then obtains an amalgamated p value by referring the new estimate to an appropriate distribution using the combined sample size. We called the former procedure paveraging and the latter raveraging. Raveraging, though the more popular of the two approaches in practice, is mysterious insofar as it provides a measure of evidence that adjusts the total p value as a function (sticking to the example considered in this section) of \(r_{12} \) and the total sample size N, even in the absence of a corresponding procedure for similarly transforming per-study p values into evidence measures.Footnote 18 And we are left with no way to confirm whether the raveraged p value is correctly reflecting the total evidence. The arguments that suggest that the raveraged p value may not be reflecting the total evidence apply to paveraging as well. As this latter method is unarguably a correct way to produce an overall p value, this further undermines interpretation of the per-study p value as an evidence measure in the first place.