P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers
 David Jean Biau MD,
 Brigitte M. Jolles MD Msc, MD,
 Raphaël Porcher PhD
 … show all 3 hide
Rent the article at a discount
Rent now* Final gross prices may vary according to local VAT.
Get AccessAbstract
In the 1920s, Ronald Fisher developed the theory behind the p value and Jerzy Neyman and Egon Pearson developed the theory of hypothesis testing. These distinct theories have provided researchers important quantitative tools to confirm or refute their hypotheses. The p value is the probability to obtain an effect equal to or more extreme than the one observed presuming the null hypothesis of no effect is true; it gives researchers a measure of the strength of evidence against the null hypothesis. As commonly used, investigators will select a threshold p value below which they will reject the null hypothesis. The theory of hypothesis testing allows researchers to reject a null hypothesis in favor of an alternative hypothesis of some effect. As commonly used, investigators choose Type I error (rejecting the null hypothesis when it is true) and Type II error (accepting the null hypothesis when it is false) levels and determine some critical region. If the test statistic falls into that critical region, the null hypothesis is rejected in favor of the alternative hypothesis. Despite similarities between the two, the p value and the theory of hypothesis testing are different theories that often are misunderstood and confused, leading researchers to improper conclusions. Perhaps the most common misconception is to consider the p value as the probability that the null hypothesis is true rather than the probability of obtaining the difference observed, or one that is more extreme, considering the null is true. Another concern is the risk that an important proportion of statistically significant results are falsely significant. Researchers should have a minimum understanding of these two theories so that they are better able to plan, conduct, interpret, and report scientific experiments.
 Bailey, CS, Fisher, CG, Dvorak, MF (2004) Type II error in the spine surgical literature. Spine (Phila Pa 1976) 29: pp. 11461149
 Biau, DJ, Kerneis, S, Porcher, R (2008) Statistics in brief: the importance of sample size in the planning and interpretation of medical research. Clin Orthop Relat Res. 466: pp. 22822288 CrossRef
 Fisher, RA (1925) Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh, UK
 Fisher, RA (1926) The arrangement of field experiments. J Ministry of Agriculture Great Britain. 33: pp. 503513
 Fisher, RA (1950) Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh, UK
 Fisher, RA (1959) Statistical Methods and Scientific Inference. Oliver and Boyd, Edinburgh, UK
 Freedman, KB, Back, S, Bernstein, J (2001) Sample size and statistical power of randomised, controlled trials in orthopaedics. J Bone Joint Surg Br. 83: pp. 397402 CrossRef
 GarciaCimbrelo, E, DiezVazquez, V, Madero, R, Munuera, L (1997) Progression of radiolucent lines adjacent to the acetabular component and factors influencing migration after Charnley lowfriction total hip arthroplasty. J Bone Joint Surg Am. 79: pp. 13731380
 Goodman, S (2008) A dirty dozen: twelve pvalue misconceptions. Semin Hematol. 45: pp. 135140 CrossRef
 Goodman, SN (1999) Toward evidencebased medical statistics. 1: The p value fallacy. Ann Intern Med 130: pp. 9951004
 Hodgkinson, JP, Shelley, P, Wroblewski, BM (1988) The correlation between the roentgenographic appearance and operative findings at the bonecement junction of the socket in Charnley low friction arthroplasties. Clin Orthop Relat Res. 228: pp. 105109
 Hopkins, PN, Williams, RR (1981) A survey of 246 suggested coronary risk factors. Atherosclerosis. 40: pp. 152 CrossRef
 Hubbard R, Bayarri MJ. P values are not error probabilities. Available at: http://www.uv.es/sestio/TechRep/tr1403.pdf. Accessed January 13, 2009.
 Kobayashi, S, Eftekhar, NS, Terayama, K, Iorio, R (1994) Risk factors affecting radiological failure of the socket in primary Charnley low friction arthroplasty: a 10 to 20year followup study. Clin Orthop Relat Res. 306: pp. 8496
 Neyman, J, Pearson, E (1933) On the problem of the most efficient tests of statistical hypotheses. Philos Trans R Soc Lond A. 231: pp. 289337 CrossRef
 Onsten, I, Akesson, K, Obrant, KJ (1995) Micromotion of the acetabular component and periacetabular bone morphology. Clin Orthop Relat Res. 310: pp. 103110
 Pearson, K (1900) On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine. 5: pp. 157175
 Schmalzried, TP, Kwong, LM, Jasty, M, Sedlacek, RC, Haire, TC, O’Connor, DO, Bragdon, CR, Kabo, JM, Malcolm, AJ, Harris, WH (1992) The mechanism of loosening of cemented acetabular components in total hip arthroplasty: analysis of specimens retrieved at autopsy. Clin Orthop Relat Res. 274: pp. 6078
 Scott, IA (2009) Evaluating cardiovascular risk assessment for asymptomatic people. BMJ. 338: pp. a2844 CrossRef
 Sterne, JA, Davey Smith, G (2001) Sifting the evidence what’s wrong with significance tests?. BMJ. 322: pp. 226231 CrossRef
 Title
 P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers
 Journal

Clinical Orthopaedics and Related Research®
Volume 468, Issue 3 , pp 885892
 Cover Date
 20100301
 DOI
 10.1007/s1199900911644
 Print ISSN
 0009921X
 Online ISSN
 15281132
 Publisher
 SpringerVerlag
 Additional Links
 Topics
 Industry Sectors
 Authors

 David Jean Biau MD ^{(1)}
 Brigitte M. Jolles MD Msc, MD ^{(2)}
 Raphaël Porcher PhD ^{(1)}
 Author Affiliations

 1. Département de Biostatistique et Informatique Médicale, INSERM–UMRS 717, APHP, Université Paris 7, Hôpital Saint Louis, 1, Avenue ClaudeVellefaux, Paris Cedex, 10 75475, France
 2. Hôpital Orthopédique Département de l’Appareil Locomoteur Centre Hospitalier, Universitaire Vaudois Université de Lausanne, Lausanne, Switzerland