Skip to main content

Theoretical and empirical distributions of the p value

Abstract

The use of p values in null hypothesis statistical tests (NHST) is controversial in the history of applied statistics, owing to a number of problems. They are: arbitrary levels of Type I error, failure to trade off Type I and Type II error, misunderstanding of p values, failure to report effect sizes, and overlooking better means of reporting estimates of policy impacts, such as effect sizes, interpreted confidence intervals, and conditional frequentist tests. This paper analyzes the theory of p values and summarizes the problems with NHST. Using a large data set of public school districts in the United States, we demonstrate empirically the unreliability of p values and hypothesis tests as predicted by the theory. We offer specific suggestions for reporting policy research.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. We included only school districts categorized by the NCES as a “local school district” or a “local school district component of a supervisory union”. These districts accounted for over 90% of all public school enrollments and contained the majority of operating public schools. Of importance to this study, these districts received a substantial portion of the total local, state and federal revenues for public schools.

References

  1. Abelson, R.P.: On the surprising longevity of flogged horses: why there is a case for the significance test. Psychol. Sci. 8, 12–20 (1997)

    Article  Google Scholar 

  2. Barnett, V.: Comparative Statistical Inference, 2nd edn. Wiley, Hoboken (1982)

    MATH  Google Scholar 

  3. Bayarri, M.J., Berger, J.O.: The interplay of Bayesian and frequentist analysis. Stat. Sci. 19, 58–80 (2004)

    MathSciNet  Article  MATH  Google Scholar 

  4. Bayarri, M.J., Berger, J.O.: P values for composite null models. J. Am. Stat. Assoc. 95, 1127–1142 (2000)

    MathSciNet  MATH  Google Scholar 

  5. Berger, J.O.: Could Fisher, Jeffreys and Neyman have agreed on testing? Stat. Sci. 18, 1–12 (2003)

    MathSciNet  Article  MATH  Google Scholar 

  6. Berger, J.O., Sellke, T.: Testing a point null hypothesis: the irreconcilability of p values and evidence. J. Am. Stat. Assoc. 82, 112–122 (1987)

    MathSciNet  MATH  Google Scholar 

  7. Berkson, J.: Tests of significance considered as evidence. J. Am. Stat. Assoc. 37, 325–335 (1942)

    Article  Google Scholar 

  8. Casella, G., Berger, R.L.: Reconciling Bayesian and frequentist evidence in the one-sided testing problem. J. Am. Stat. Assoc. 62, 106–111 (1987)

    MathSciNet  Article  MATH  Google Scholar 

  9. William, G.: Cochran, Sampling Techniques, 3rd edn. Wiley, New York (1977)

    Google Scholar 

  10. Cowles, M., Davis, C.: On the origins of the.05 level of statistical significance. Am. Psychol. 37, 553–558 (1982)

    Article  Google Scholar 

  11. Cox, D.R.: The role of significance tests. Scand. J. Stat. 49–63 (1977)

  12. Cumming, G.: The new statistics: why and how. Psychol. Sci. 25, 7–29 (2014)

    Article  Google Scholar 

  13. Davidson, R., Mackinnon, J.G.: Estimation and Inference in Econometrics. Oxford University Press, Oxford (1993)

    MATH  Google Scholar 

  14. Durbin, J.: Estimation of parameters in time-series regression models. J. Roy. Stat. Soc. B 22, 139–153 (1960)

    MathSciNet  MATH  Google Scholar 

  15. Gigerenzer, G.: 2014: What scientific idea is ready for retirement? http://edge.org/response-detail/25462. Accessed 31 Aug 2014

  16. Gravetter, F.J., Wallnau, L.B.: Statistics for the Behavioral Sciences, Chapter 8, “Introduction to Hypothesis Testing”. Cengage Learning, Wadsworth (2009)

    Google Scholar 

  17. Hamilton, J.D.: Time Series Analysis. Princeton University Press, Princeton (1994)

    MATH  Google Scholar 

  18. Harris, R.J.: Significance Tests Have Their Place. Psychol. Sci. 8, 8–11 (1997)

    Article  Google Scholar 

  19. Hodgson, R.T.: The problem of being a normal deviate. Am. J. Phys. 47, 1092–1093 (1979)

    Article  Google Scholar 

  20. Hunter, J.E.: Needed: a ban on the significance test. Psychol. Sci. 8, 3–7 (1997)

    Article  Google Scholar 

  21. Kelly, M.: Emily Dickinson and the monkeys on the stair Or: what is the significance of the 5% significance level. Significance 10, 21–22 (2013)

    Article  Google Scholar 

  22. Kendall, N.G., Stuart, A., Ord, J.K.: The Advanced Theory of Statistics, vol. III. Griffin, London (1987)

    MATH  Google Scholar 

  23. Kiefer, J.: Lecture Notes on Statistical Inference, mimeographed, n.d., n.p., obtained from the Cornell University Department of Mathematics (1979)

  24. Killeen, P.R.: An alternative to null-hypothesis significance tests. Psychol. Sci. 16, 345–353 (2005)

    Article  Google Scholar 

  25. MacKinnon, D.P., Fairchild, A.J.: Current directions in mediation analysis. Curr. Dir. Psychol. Sci. 18, 16–20 (2009)

    Article  Google Scholar 

  26. Morrison, D.E., Henkel, R.E. (eds.): The significance test controversy–a reader. Butterworth, London (1970)

    Google Scholar 

  27. Poole, C.: Beyond the confidence interval. Am. J. Public Health 77, 195–199 (1987)

    Article  Google Scholar 

  28. Robins, J.M., van der Vaart, A., Ventura, V.R.: Asymptotic distribution of p values in composite null models. J. Am. Stat. Assoc. 95, 1143–1156 (2000)

    MathSciNet  MATH  Google Scholar 

  29. Royall, R.: Statistical evidence: a likelihood paradigm, vol. 71. CRC Press, Boca Raton (1997)

    MATH  Google Scholar 

  30. Royall, R.: On the probability of observing misleading statistical evidence. J. Am. Stat. Assoc. 95, 760–768 (2000)

    MathSciNet  Article  MATH  Google Scholar 

  31. Scarr, S.: Rules of evidence: a larger context for the statistical debate. Psychol. Sci. 8, 16–17 (1997)

    Article  Google Scholar 

  32. Schervish, M.J.: P values: what they are and what they are not. Am Stat 50, 203–206 (1996)

    MathSciNet  Google Scholar 

  33. Sellke, T., Bayarri, M.J., Berger, J.O.: Calibration of \(p\) values for testing precise null hypotheses. Am Statistician 55, 62–71 (2001)

    MathSciNet  Article  MATH  Google Scholar 

  34. Spjøtvoll, E.: Discussion of D.R. Cox’s paper. Scand. J. Stat. 63–66 (1977)

  35. U.S. Department of Commerce.: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, Applied Mathematics Series Volume 55 (1966)

  36. Wellek, S.: A critical evaluation of the current “p-value controversy”. Biometr J (2017)

  37. van Zwet, W.R.: Discussion of D.R. Cox’s paper. Scand. J. Stat. 67 (1977)

  38. Yates, F.: The influence of statistical methods for research workers on the development of the science of statistics. J. Am. Stat. Assoc. 46, 19–34 (1951)

    Google Scholar 

  39. Ziliak, S.T., McCloskey, D.N.: The cult of statistical significance. JSM conference paper, Section on Economic Education (2009)

    MATH  Google Scholar 

  40. Ziliak, S.T., McCloskey, D.N.: The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. The University of Michigan Press, Ann Arbor (2008)

    MATH  Google Scholar 

  41. Hempel, C.: Contributions to the Logical Analysis of the Concept of Probability, Ph.D. thesis, Frederich Wilhelm University of Berlin, in German, translated by the author (1934)

  42. von Mises, R.: Probability, Statistics, and Truth. Dover Publications, Inc, New York (1981). (reprint of earlier editions, the earliest being 1928)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter Jones.

Additional information

The authors thank Amy Sibulkin for asking questions which led to this paper, participants in a seminar at the University of Kentucky, and an anonymous referee for additions which are significant using either frequentist or Bayesian methods.

Appendices

Appendix A: Asymmetrical confidence intervals

Confidence intervals for Normally distributed estimators are symmetrical and constructed in the familiar way taught in statistics courses. For a large enough sample, continuous and differentiable functions of Normally distributed estimators are also distributed Normal, using the delta method, but the sample size might need to be very large. For example, chi square distributions refer to squares of standard Normals, and chi square distributions become Normal but the number of degrees of freedom must go to infinity. Ratios of Normally distributed estimators are technically Cauchy because of the non-zero density at value zero in the denominator. The precision must increase to the point that no estimate could with any probability large enough to matter be at zero, e.g. male adult American height is distributed Normal with a mean of 68 inches, variance of 4 squared inches, which is 34 standard deviations from 0. For estimators, that might be difficult. The probability distribution of ratios of random variables can be a Cauchy with no finite mean or variance [19].

Products of estimators can arise in various ways, e.g. mediation models in psychology [25] or autoregressive disturbances in macroeconomics, i.e. AR(1) as estimated by [14], for which see Hamilton [17, p. 226]. Exponentiation of logit or hazard model coefficients can also result in non-Normal distributions unless the sample size is very large. The correct distribution is lognormal which becomes Normal as the underlying variance becomes small, i.e. the sample size becomes large.

If the confidence interval is asymmetric, it should be chosen to minimize the width while equalizing the probability density function at each end, while fixing the total probability at 95% or whatever arbitrary value is intended by the researcher. That can be done mathematically in modern times, but special computer programs are required. Symmetric non-linear confidence intervals are inefficiently wide. Table 4 shows asymmetric confidence intervals for the standard problem of exponentiating a normally distributed estimated parameter. The intervals are closer to zero and only slightly shorter if the standard error is small, but much shorter if the standard error is large.

When asymptotic distributions are asymmetric, confidence intervals do not correspond to p values in the usual calculation from Bell curves [34].

Using resampling variances, the confidence intervals can be computed from the empirical distribution of estimates. Using Bayesian variances, the confidence intervals can be computed from the posterior probability distribution of the parameter.

Table 4 Symmetric and asymmetric confidence intervals for an exponentiated estimator. The effect size is irrelevant as both are scaled by its exponentiated value; 0 assumed here. The table is based on the s.e. of the exponentiated estimator.

Appendix B: Existent and hypothetical populations, or finite and infinite collectives

Policy analysis can proceed by collecting data on all charter schools or all environmental laws in a state or nation or all wars over a period of time. This can be construed as a population, but under that interpretation, there appears to be no uncertainty at all about the statistics or relationships in the data. Measurement error and omitted variables would continue to be explanations for standard errors, but neither of those is a satisfactory basis for evaluating policy analysis, as the first invalidates the data, and the second invalidates the model. There is a philosophical problem of justifying the application of inferential statistical formulae to an apparently complete census of a population.

If the sample really is a substantial subset of the population, the variances of sample means, and by implication moments and maximum likelihood estimators, are subject to the finite population correction (Cochran [9, Sect. 2.6, pp. 24–25]). If the sample size is n as usual while the population size is N, the variance is reduced by the factor (N – n)/N. The standard deviation is reduced by \(\surd \)[(N – n)/(N – 1)]. The covariance is reduced by a factor of [(N – n)/(N – 1)]. This can be safely ignored if the sample is a small part of the population, i.e. n/N is small, but under the interpretation that the present population of people, places, or things is all that matters, the finite population correction factor should always be applied, which would eliminate the variance of fixed effects for states in most policy studies. Ignoring the finite population correction factor would then be a standard mistake.

The philosophical interpretation that eliminates the problem in general is based on a distinction between existent and hypothetical populations (Kendall et al. ([22], Sect. 1.29, pp. 22–23)). The work by von Mises [42, pp. 98–99], in which the word “collective” refers to the more conventional “population” makes a distinction between the idea that “the calculus of probability deals each time merely with one single collective, whose distribution is subjected to certain summations or integrations”, i.e. the “restriction to one single initial collective” versus the “admissible distribution functions, the nature of the sub-sets of the attribute space for which probabilities can be defined, etc”. That is, there is a large set of possible combinations of attributes of the states, people, places, or things studied, and that is the collective which is studied. There could be a million different versions of New York State, only one of which is in fact observed. This makes n/N go to zero and justifies standard empirical inferential statistics.

Returning to Kendall et al. ([22]), the sample can be less than the entire population because some people were not included, but could in principle have been, or because not every possible toss of a die has been recorded, which could not even in principle be done. Those other tosses have a hypothetical existence. The hypothetical population also applies to people—in Kendall et al. ([22], p. 23), to tsetse flies—as their condition and attributes could be different in a vast number of ways.

It must be emphasized that the distinction between existent and hypothetical populations is not merely a matter of ontological speculation—if it were we could safely ignore it—but one of practical importance when inferences are drawn about a population from a sample generated by it (Kendall et al. ([22], p. 23)).

Kendall et al. ([22], Sect. 9.4, p. 292) pursue the idea of a hypothetical population further. If empirical probability is “a limiting relative frequency”, then the limit concept requires that samples go to infinity in principle, and “we must be able to contemplate a series of replications under identical conditions and to specify the possible values that might be realized” (Kendall et al. ([22], p. 292)). Then “we must be willing that our observed value was randomly selected from the set of possibilities. At first sight, this is a rather baffling conception. However, such a structure is fundamental in a frequency-based theory of inference, and the approach is justified as being empirically useful” (p. 293).

The explanations of empirical probability by von Mises [42] repeatedly refer to the “limiting value of relative frequencies” (pp. 12, 21, 82, 105, 110, 124, 226, 229). The alternative to the argument of infinite sequences of observations from a hypothetical population is a “finite collective” (pp. 82–83). “There is no doubt about the fact that the sequences of observations to which the theory of probability is applied in practice are all finite” (p. 82). That is, samples are finite, and all states might be observed once. Infinite sequences are not substituted for the finite sequences. Probability and the resulting statistics are calculated based on the finite sequences. Hempel [41] similarly argued that “the results of a theory based on the notion of an infinite collective can be applied to finite sequences of observations” (von Mises [42, p. 85]).

In econometrics, the assumption is that there is a disturbance term in addition to the explanatory variables, so that given any set of explanatory variables, an infinite number of results (an infinite population or collective) is possible for any state, person, place, or thing. The mean, variance, and possibly the probability distribution could be estimated with enough data.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Butler, J.S., Jones, P. Theoretical and empirical distributions of the p value. METRON 76, 1–30 (2018). https://doi.org/10.1007/s40300-017-0130-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40300-017-0130-2

Keywords

  • p values
  • Null hypothesis statistical tests (NHST)
  • Education finance