## Abstract

The use of p values in null hypothesis statistical tests (NHST) is controversial in the history of applied statistics, owing to a number of problems. They are: arbitrary levels of Type I error, failure to trade off Type I and Type II error, misunderstanding of p values, failure to report effect sizes, and overlooking better means of reporting estimates of policy impacts, such as effect sizes, interpreted confidence intervals, and conditional frequentist tests. This paper analyzes the theory of p values and summarizes the problems with NHST. Using a large data set of public school districts in the United States, we demonstrate empirically the unreliability of p values and hypothesis tests as predicted by the theory. We offer specific suggestions for reporting policy research.

This is a preview of subscription content, access via your institution.

## Notes

We included only school districts categorized by the NCES as a “local school district” or a “local school district component of a supervisory union”. These districts accounted for over 90% of all public school enrollments and contained the majority of operating public schools. Of importance to this study, these districts received a substantial portion of the total local, state and federal revenues for public schools.

## References

Abelson, R.P.: On the surprising longevity of flogged horses: why there is a case for the significance test. Psychol. Sci.

**8**, 12–20 (1997)Barnett, V.: Comparative Statistical Inference, 2nd edn. Wiley, Hoboken (1982)

Bayarri, M.J., Berger, J.O.: The interplay of Bayesian and frequentist analysis. Stat. Sci.

**19**, 58–80 (2004)Bayarri, M.J., Berger, J.O.: P values for composite null models. J. Am. Stat. Assoc.

**95**, 1127–1142 (2000)Berger, J.O.: Could Fisher, Jeffreys and Neyman have agreed on testing? Stat. Sci.

**18**, 1–12 (2003)Berger, J.O., Sellke, T.: Testing a point null hypothesis: the irreconcilability of p values and evidence. J. Am. Stat. Assoc.

**82**, 112–122 (1987)Berkson, J.: Tests of significance considered as evidence. J. Am. Stat. Assoc.

**37**, 325–335 (1942)Casella, G., Berger, R.L.: Reconciling Bayesian and frequentist evidence in the one-sided testing problem. J. Am. Stat. Assoc.

**62**, 106–111 (1987)William, G.: Cochran, Sampling Techniques, 3rd edn. Wiley, New York (1977)

Cowles, M., Davis, C.: On the origins of the.05 level of statistical significance. Am. Psychol.

**37**, 553–558 (1982)Cox, D.R.: The role of significance tests. Scand. J. Stat. 49–63 (1977)

Cumming, G.: The new statistics: why and how. Psychol. Sci.

**25**, 7–29 (2014)Davidson, R., Mackinnon, J.G.: Estimation and Inference in Econometrics. Oxford University Press, Oxford (1993)

Durbin, J.: Estimation of parameters in time-series regression models. J. Roy. Stat. Soc. B

**22**, 139–153 (1960)Gigerenzer, G.: 2014: What scientific idea is ready for retirement? http://edge.org/response-detail/25462. Accessed 31 Aug 2014

Gravetter, F.J., Wallnau, L.B.: Statistics for the Behavioral Sciences, Chapter 8, “Introduction to Hypothesis Testing”. Cengage Learning, Wadsworth (2009)

Hamilton, J.D.: Time Series Analysis. Princeton University Press, Princeton (1994)

Harris, R.J.: Significance Tests Have Their Place. Psychol. Sci.

**8**, 8–11 (1997)Hodgson, R.T.: The problem of being a normal deviate. Am. J. Phys.

**47**, 1092–1093 (1979)Hunter, J.E.: Needed: a ban on the significance test. Psychol. Sci.

**8**, 3–7 (1997)Kelly, M.: Emily Dickinson and the monkeys on the stair Or: what is the significance of the 5% significance level. Significance

**10**, 21–22 (2013)Kendall, N.G., Stuart, A., Ord, J.K.: The Advanced Theory of Statistics, vol. III. Griffin, London (1987)

Kiefer, J.: Lecture Notes on Statistical Inference, mimeographed, n.d., n.p., obtained from the Cornell University Department of Mathematics (1979)

Killeen, P.R.: An alternative to null-hypothesis significance tests. Psychol. Sci.

**16**, 345–353 (2005)MacKinnon, D.P., Fairchild, A.J.: Current directions in mediation analysis. Curr. Dir. Psychol. Sci.

**18**, 16–20 (2009)Morrison, D.E., Henkel, R.E. (eds.): The significance test controversy–a reader. Butterworth, London (1970)

Poole, C.: Beyond the confidence interval. Am. J. Public Health

**77**, 195–199 (1987)Robins, J.M., van der Vaart, A., Ventura, V.R.: Asymptotic distribution of p values in composite null models. J. Am. Stat. Assoc.

**95**, 1143–1156 (2000)Royall, R.: Statistical evidence: a likelihood paradigm, vol. 71. CRC Press, Boca Raton (1997)

Royall, R.: On the probability of observing misleading statistical evidence. J. Am. Stat. Assoc.

**95**, 760–768 (2000)Scarr, S.: Rules of evidence: a larger context for the statistical debate. Psychol. Sci.

**8**, 16–17 (1997)Schervish, M.J.: P values: what they are and what they are not. Am Stat

**50**, 203–206 (1996)Sellke, T., Bayarri, M.J., Berger, J.O.: Calibration of \(p\) values for testing precise null hypotheses. Am Statistician

**55**, 62–71 (2001)Spjøtvoll, E.: Discussion of D.R. Cox’s paper. Scand. J. Stat. 63–66 (1977)

U.S. Department of Commerce.: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, Applied Mathematics Series Volume 55 (1966)

Wellek, S.: A critical evaluation of the current “p-value controversy”. Biometr J (2017)

van Zwet, W.R.: Discussion of D.R. Cox’s paper. Scand. J. Stat.

**67**(1977)Yates, F.: The influence of statistical methods for research workers on the development of the science of statistics. J. Am. Stat. Assoc.

**46**, 19–34 (1951)Ziliak, S.T., McCloskey, D.N.: The cult of statistical significance. JSM conference paper, Section on Economic Education (2009)

Ziliak, S.T., McCloskey, D.N.: The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. The University of Michigan Press, Ann Arbor (2008)

Hempel, C.: Contributions to the Logical Analysis of the Concept of Probability, Ph.D. thesis, Frederich Wilhelm University of Berlin, in German, translated by the author (1934)

von Mises, R.: Probability, Statistics, and Truth. Dover Publications, Inc, New York (1981). (reprint of earlier editions, the earliest being 1928)

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

The authors thank Amy Sibulkin for asking questions which led to this paper, participants in a seminar at the University of Kentucky, and an anonymous referee for additions which are significant using either frequentist or Bayesian methods.

## Appendices

### Appendix A: Asymmetrical confidence intervals

Confidence intervals for Normally distributed estimators are symmetrical and constructed in the familiar way taught in statistics courses. For a large enough sample, continuous and differentiable functions of Normally distributed estimators are also distributed Normal, using the delta method, but the sample size might need to be very large. For example, chi square distributions refer to squares of standard Normals, and chi square distributions become Normal but the number of degrees of freedom must go to infinity. Ratios of Normally distributed estimators are technically Cauchy because of the non-zero density at value zero in the denominator. The precision must increase to the point that no estimate could with any probability large enough to matter be at zero, e.g. male adult American height is distributed Normal with a mean of 68 inches, variance of 4 squared inches, which is 34 standard deviations from 0. For estimators, that might be difficult. The probability distribution of ratios of random variables can be a Cauchy with no finite mean or variance [19].

Products of estimators can arise in various ways, e.g. mediation models in psychology [25] or autoregressive disturbances in macroeconomics, i.e. AR(1) as estimated by [14], for which see Hamilton [17, p. 226]. Exponentiation of logit or hazard model coefficients can also result in non-Normal distributions unless the sample size is very large. The correct distribution is lognormal which becomes Normal as the underlying variance becomes small, i.e. the sample size becomes large.

If the confidence interval is asymmetric, it should be chosen to minimize the width while equalizing the probability density function at each end, while fixing the total probability at 95% or whatever arbitrary value is intended by the researcher. That can be done mathematically in modern times, but special computer programs are required. Symmetric non-linear confidence intervals are inefficiently wide. Table 4 shows asymmetric confidence intervals for the standard problem of exponentiating a normally distributed estimated parameter. The intervals are closer to zero and only slightly shorter if the standard error is small, but much shorter if the standard error is large.

When asymptotic distributions are asymmetric, confidence intervals do not correspond to p values in the usual calculation from Bell curves [34].

Using resampling variances, the confidence intervals can be computed from the empirical distribution of estimates. Using Bayesian variances, the confidence intervals can be computed from the posterior probability distribution of the parameter.

### Appendix B: Existent and hypothetical populations, or finite and infinite collectives

Policy analysis can proceed by collecting data on all charter schools or all environmental laws in a state or nation or all wars over a period of time. This can be construed as a population, but under that interpretation, there appears to be no uncertainty at all about the statistics or relationships in the data. Measurement error and omitted variables would continue to be explanations for standard errors, but neither of those is a satisfactory basis for evaluating policy analysis, as the first invalidates the data, and the second invalidates the model. There is a philosophical problem of justifying the application of inferential statistical formulae to an apparently complete census of a population.

If the sample really is a substantial subset of the population, the variances of sample means, and by implication moments and maximum likelihood estimators, are subject to the finite population correction (Cochran [9, Sect. 2.6, pp. 24–25]). If the sample size is n as usual while the population size is N, the variance is reduced by the factor (N – n)/N. The standard deviation is reduced by \(\surd \)[(N – n)/(N – 1)]. The covariance is reduced by a factor of [(N – n)/(N – 1)]. This can be safely ignored if the sample is a small part of the population, i.e. n/N is small, but under the interpretation that the present population of people, places, or things is all that matters, the finite population correction factor should always be applied, which would eliminate the variance of fixed effects for states in most policy studies. Ignoring the finite population correction factor would then be a standard mistake.

The philosophical interpretation that eliminates the problem in general is based on a distinction between existent and hypothetical populations (Kendall et al. ([22], Sect. 1.29, pp. 22–23)). The work by von Mises [42, pp. 98–99], in which the word “collective” refers to the more conventional “population” makes a distinction between the idea that “the calculus of probability deals each time merely with one single collective, whose distribution is subjected to certain summations or integrations”, i.e. the “restriction to one single initial collective” versus the “admissible distribution functions, the nature of the sub-sets of the attribute space for which probabilities can be defined, etc”. That is, there is a large set of possible combinations of attributes of the states, people, places, or things studied, and that is the collective which is studied. There could be a million different versions of New York State, only one of which is in fact observed. This makes n/N go to zero and justifies standard empirical inferential statistics.

Returning to Kendall et al. ([22]), the sample can be less than the entire population because some people were not included, but could in principle have been, or because not every possible toss of a die has been recorded, which could not even in principle be done. Those other tosses have a hypothetical existence. The hypothetical population also applies to people—in Kendall et al. ([22], p. 23), to tsetse flies—as their condition and attributes could be different in a vast number of ways.

It must be emphasized that the distinction between existent and hypothetical populations is not merely a matter of ontological speculation—if it were we could safely ignore it—but one of practical importance when inferences are drawn about a population from a sample generated by it (Kendall et al. ([22], p. 23)).

Kendall et al. ([22], Sect. 9.4, p. 292) pursue the idea of a hypothetical population further. If empirical probability is “a limiting relative frequency”, then the limit concept requires that samples go to infinity in principle, and “we must be able to contemplate a series of replications under identical conditions and to specify the possible values that might be realized” (Kendall et al. ([22], p. 292)). Then “we must be willing that our observed value was randomly selected from the set of possibilities. At first sight, this is a rather baffling conception. However, such a structure is fundamental in a frequency-based theory of inference, and the approach is justified as being empirically useful” (p. 293).

The explanations of empirical probability by von Mises [42] repeatedly refer to the “limiting value of relative frequencies” (pp. 12, 21, 82, 105, 110, 124, 226, 229). The alternative to the argument of infinite sequences of observations from a hypothetical population is a “finite collective” (pp. 82–83). “There is no doubt about the fact that the sequences of observations to which the theory of probability is applied in practice are all finite” (p. 82). That is, samples are finite, and all states might be observed once. Infinite sequences are not substituted for the finite sequences. Probability and the resulting statistics are calculated based on the finite sequences. Hempel [41] similarly argued that “the results of a theory based on the notion of an infinite collective can be applied to finite sequences of observations” (von Mises [42, p. 85]).

In econometrics, the assumption is that there is a disturbance term in addition to the explanatory variables, so that given any set of explanatory variables, an infinite number of results (an infinite population or collective) is possible for any state, person, place, or thing. The mean, variance, and possibly the probability distribution could be estimated with enough data.

## Rights and permissions

## About this article

### Cite this article

Butler, J.S., Jones, P. Theoretical and empirical distributions of the p value.
*METRON* **76**, 1–30 (2018). https://doi.org/10.1007/s40300-017-0130-2

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s40300-017-0130-2

### Keywords

- p values
- Null hypothesis statistical tests (NHST)
- Education finance