Abstract
In the 1920s, Ronald Fisher developed the theory behind the p value and Jerzy Neyman and Egon Pearson developed the theory of hypothesis testing. These distinct theories have provided researchers important quantitative tools to confirm or refute their hypotheses. The p value is the probability to obtain an effect equal to or more extreme than the one observed presuming the null hypothesis of no effect is true; it gives researchers a measure of the strength of evidence against the null hypothesis. As commonly used, investigators will select a threshold p value below which they will reject the null hypothesis. The theory of hypothesis testing allows researchers to reject a null hypothesis in favor of an alternative hypothesis of some effect. As commonly used, investigators choose Type I error (rejecting the null hypothesis when it is true) and Type II error (accepting the null hypothesis when it is false) levels and determine some critical region. If the test statistic falls into that critical region, the null hypothesis is rejected in favor of the alternative hypothesis. Despite similarities between the two, the p value and the theory of hypothesis testing are different theories that often are misunderstood and confused, leading researchers to improper conclusions. Perhaps the most common misconception is to consider the p value as the probability that the null hypothesis is true rather than the probability of obtaining the difference observed, or one that is more extreme, considering the null is true. Another concern is the risk that an important proportion of statistically significant results are falsely significant. Researchers should have a minimum understanding of these two theories so that they are better able to plan, conduct, interpret, and report scientific experiments.
Similar content being viewed by others
References
Bailey CS, Fisher CG, Dvorak MF. Type II error in the spine surgical literature. Spine (Phila Pa 1976). 2004;29:1146–1149.
Biau DJ, Kerneis S, Porcher R. Statistics in brief: the importance of sample size in the planning and interpretation of medical research. Clin Orthop Relat Res. 2008;466:2282–2288.
Fisher RA. Statistical Methods for Research Workers. Edinburgh, UK: Oliver and Boyd; 1925.
Fisher RA. The arrangement of field experiments. J Ministry of Agriculture Great Britain. 1926;33:503–513.
Fisher RA. Statistical Methods for Research Workers. Ed 11 (rev). Edinburgh, UK: Oliver and Boyd; 1950.
Fisher RA. Statistical Methods and Scientific Inference. Ed 2 (rev). Edinburgh, UK: Oliver and Boyd; 1959.
Freedman KB, Back S, Bernstein J. Sample size and statistical power of randomised, controlled trials in orthopaedics. J Bone Joint Surg Br. 2001;83:397–402.
Garcia-Cimbrelo E, Diez-Vazquez V, Madero R, Munuera L. Progression of radiolucent lines adjacent to the acetabular component and factors influencing migration after Charnley low-friction total hip arthroplasty. J Bone Joint Surg Am. 1997;79:1373–1380.
Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol. 2008;45:135–140.
Goodman SN. Toward evidence-based medical statistics. 1: The p value fallacy. Ann Intern Med. 1999;130:995–1004.
Hodgkinson JP, Shelley P, Wroblewski BM. The correlation between the roentgenographic appearance and operative findings at the bone-cement junction of the socket in Charnley low friction arthroplasties. Clin Orthop Relat Res. 1988;228:105–109.
Hopkins PN, Williams RR. A survey of 246 suggested coronary risk factors. Atherosclerosis. 1981;40:1–52.
Hubbard R, Bayarri MJ. P values are not error probabilities. Available at: http://www.uv.es/sestio/TechRep/tr14-03.pdf. Accessed January 13, 2009.
Kobayashi S, Eftekhar NS, Terayama K, Iorio R. Risk factors affecting radiological failure of the socket in primary Charnley low friction arthroplasty: a 10- to 20-year followup study. Clin Orthop Relat Res. 1994;306:84–96.
Neyman J, Pearson E. On the problem of the most efficient tests of statistical hypotheses. Philos Trans R Soc Lond A. 1933;231:289–337.
Onsten I, Akesson K, Obrant KJ. Micromotion of the acetabular component and periacetabular bone morphology. Clin Orthop Relat Res. 1995;310:103–110.
Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine. 1900;5:157–175.
Schmalzried TP, Kwong LM, Jasty M, Sedlacek RC, Haire TC, O’Connor DO, Bragdon CR, Kabo JM, Malcolm AJ, Harris WH. The mechanism of loosening of cemented acetabular components in total hip arthroplasty: analysis of specimens retrieved at autopsy. Clin Orthop Relat Res. 1992;274:60–78.
Scott IA. Evaluating cardiovascular risk assessment for asymptomatic people. BMJ. 2009;338:a2844.
Sterne JA, Davey Smith G. Sifting the evidence what’s wrong with significance tests? BMJ. 2001;322:226–231.
Author information
Authors and Affiliations
Corresponding author
Additional information
Each author certifies that he or she has no commercial associations (eg, consultancies, stock ownership, equity interest, patent/licensing arrangements, etc) that might pose a conflict of interest in connection with the submitted article.
About this article
Cite this article
Biau, D.J., Jolles, B.M. & Porcher, R. P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers. Clin Orthop Relat Res 468, 885–892 (2010). https://doi.org/10.1007/s11999-009-1164-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11999-009-1164-4