Despite its ubiquity in scientific and medical literature, the p value is a commonly misunderstood and misinterpreted statistical concept. Tam et al. conducted a survey-based study examining quantitative and qualitative understanding of p values among a group of 247 general practitioners and demonstrated that the two most common misconceptions were (1) the use of p values to represent real-world probabilities, correlating significance to a 95% chance of the test hypothesis being true vs a 5% chance of being false, and (2) use of P = 0.05 as a threshold for evidence of observable results (i.e., P ≤ 0.05 = observable effect; P > 0.05 = not observable) [1]. When combined, these two conceptualizations accounted for 83% of the survey responses. Another study of statistical literacy among 277 residents from 11 different programs demonstrated that only 58% of the participants could correctly interpret a p value despite 88% of respondents indicated confidence in their understanding of p values, demonstrating a clear confidence-ability gap [2]. We can reasonably assume this gap also exists among practicing Orthopedic surgeons.

Beyond confusing the individual reader, poor understanding of p values has grave and far-reaching effects on the larger research community. Poor statistical literacy among researchers and peer-reviewers has led to widespread publication of biased studies, many with potentially irreproducible results. This phenomenon jeopardizes the development of a reliable scientific knowledge base. Practices such as p-hacking or cherry-picking data such that insignificant findings demonstrate significance have even further weakened the integrity of clinical research [3]. Through flooding the literature with potentially misleading information, the misuse and manipulation of p values threatens the future of evidence-based medicine, and we must course-correct.

To begin, we must properly define the p value. The p value is not a measure of the probability of a hypothesis being true or untrue. The standard significance level (alpha) arbitrarily defined as P = 0.05 represents the threshold value to either help support (P ≤ 0.05) or reject (P > 0.05) the null hypothesis, which is the assumption that two groups are the same [4]. Thus, the p value is a measure of the degree to which experimental data conform to the distribution predicted by the null hypothesis. Or stated another way, the p value represents how likely one is to obtain the observed result, given the null hypothesis is true or the probability that we obtained the current result by chance (Fig. 1). In this sense, the lower the p value, the less likely it is that the observed result or difference is due to chance.

Fig. 1
figure 1

Adapted from Wikimedia Commons at https://commons.wikimedia.org

Shaded area (green) represents values in the distribution with probability greater than alpha, where alpha is traditionally equal to 0.05. Under the assumption that the null hypothesis is true, this shaded region represents results that are unlikely to have occurred by chance alone, suggesting that the null hypothesis is not the best-fitting hypothesis.

One of the most common fallacies related to p values is that there can be a complete acceptance or rejection of the null hypothesis without consideration of other factors. Obtaining a value of P ≤ 0.05 suggests that given the assumption that the null hypothesis is true, and there will be a difference between the two variables as great or greater than the one observed no more than 5% of the time if only chance affects the observed relationship. In reality, many factors may introduce errors, which can lead to a misleadingly small p value [5]. For example, at times, the researcher may perform multiple comparisons and though there was an established p value threshold of 0.05, the researcher did not adjust this for multiple comparisons (discussed below). Likewise, researchers may want to obtain more confidence in their findings, and therefore, it may be better at times to adjust the p value to a lower threshold (0.01 or even 0.001) as is common practice in fields such as astrophysics, genomics, among others. With a better working understanding of p values as related to how well data fits a prediction, researchers can also challenge Type 1 errors (false positives) which result from an erroneous rejection of the null hypothesis. In other words, reporting that there is an observed difference due to the experimental conditions will be more meaningful when the potential sources for error are acknowledged (i.e., selection bias, repeated measurements, and more).

Another common misconception amongst statistical significance is associating the level of significance (i.e., size of the p value) with the magnitude of the effect (or effect size). In this manner, for example, some readers may interpret a p value of 0.0001 to not only indicate statistical significance in post-operative outcome improvement, but may also assume, since the p value is so small, the effect size or the magnitude of post-operative outcome improvement is also very large—but this is not always the case.

A p value is dependent on both effect size, or the strength of the relationship between the variable, and the sample size. A small p value could be due to a large effect, or it could result from a small effect and a large enough sample size, and vice versa [6]. When examining research literature, therefore, we must maintain effect size, variance (spread of data), and sample size in mind as they all play a part in determining the final p value. As mentioned above, the commonly accepted threshold value for P is 0.05; however, this may not be an appropriate threshold for all studies. Imagine two strands of DNA that are statistically different based on nucleotide sequence, yet both could code for the same functional protein or a sufficient level of functional protein so that the organism is unaffected. In this case, there would be no clinical relevance associated with the statistical significance (Fig. 2) [7]. In addition, p values are often two-sided, meaning the effect size is neither more nor less than the accepted value (0 for the null). However, for questions related only to improving outcomes, which is probably the most commonly posed question in clinical research, a one-sided p value may be more appropriate.

Fig. 2
figure 2

Potential cases of statistical significance vs clinical relevance. Figure adapted from Ganesh et al. 2017

Comparing p values between studies can also be a common source of confusion. Foremost, p values determined when testing in two different samples cannot be compared. This means that if study A examining two treatments resulted in P = 0.1 and study B examining similar treatments resulted in P = 0.001, the results of both studies are not necessarily contradictory. Conversely, if P = 0.03 for both samples, it does not mean that the study results agree. This is mainly due to the possible variations in sample sizes, as well as standard deviation of the measured variable, both of which indirectly and directly influence the p value. It should be noted that very few cross-study conclusions can be drawn from the p value and often further statistical analysis, such as the determination of confidence intervals, is required [6, 8]. It is in cases such as this where large registry studies and prospective randomized controlled trials most benefit us, allowing researchers to examine superiority of treatment methodologies in patient populations.

Many researchers have noted the increase in frequency of p values near the threshold of 0.05, yet noticed the common occurrence of values just below (i.e., P = 0.048) than just above (i.e., P = 0.051), and this is often an indicator of p-hacking. The practice of p-hacking, defined as the manipulation of data to fabricate significance, is a systematic problem in research. Authors who practice p-hacking are often concerned with increasing their personal number of peer-reviewed publications; these authors sacrifice the quality of their scientific investigation in preference for studies which shows a statistical difference due to positive publication bias. Common ways to p-hack include assessing multiple outcomes and only reporting those that show significance or increasing the sample size until the p value is within range of significance without an explicit a-priori power (sample size) calculation. Bin Abd Razak et al. investigated the articles published in 2015 by 3 top orthopaedic journals [9]. Through text-mining, they identified articles that provided a single p value when reporting results for their main hypothesis. Theoretically, the frequency of p values reported should decrease as the p value increases towards 0.05, because observed real world phenomena should presumably have a strong relationship between variables (large effect size) and small p value. Following this logic, a higher frequency of small p values is expected; however, the study showed an upwards trend when approaching P = 0.05, suggesting the presence of p-hacking [9].

The high fragility of many orthopaedic studies only serves as further evidence of p-hacking in the field of orthopaedic research. Defined as the number of outcomes or patients that must change to affect the significance of the results, fragility provides a measure of the quantitative stability of our findings. Parisien et al. screened studies from two journals of comparative sports medicine between 2006 and 2016. From 102 studies included, 339 outcomes were defined, of which 98 were reported as significant and 241 were non-significant [10]. The fragility index, or median number of events needed to change to affect statistical significance, was determined to be 5 (IQR 3 to 8), but the average loss of patients to follow-up was greater at 7.9, challenging study reliability. Forrester et al. identified 128 surgical clinical studies in orthopaedic trauma with 545 outcomes. Again, the reported loss to follow-up was greater than the median fragility index (5) for over half of the studies included (53.3%) [11]. As surgeons, we are often affecting patients directly, and it is problematic that most of our decision making often are based on studies, where crossover of less than 5 events can make the results both clinically and statistically insignificant.

A potential solution proposed by critics of the p value is the Bonferroni correction, which involves dividing the predefined value for the significance level, alpha (P = 0.05), by the number of tests being performed, where tests are most commonly t tests or tests calculating Pearson’s Correlation Coefficient (r) [12]. If 4 t tests were performed, the new value alpha level would be 0.0125 for each test so that the overall alpha level for all tests remains 0.05. This application of the Bonferroni correction addresses experimentwise error; however, the correction can also correct familywise error which occurs when related groups are compared through analysis of variance (ANOVA) [13, 13].

It should be noted that the Bonferroni correction is not without criticism. For example, decreasing the frequency of type I errors inherently increases the frequency of type II errors (false negatives) [12, 13]. Moreover, the correction addresses a “universal” null hypothesis even though the groups being compared may not be exactly identical in all comparisons [13]. On the whole, the Bonferroni correction is seen by some as statistically conservative, or in layman’s terms, overly cautious. Other alternatives offered to the p value are the inclusion of 95% confidence intervals or Bayes factor. The most fervent p value critics suggest doing away with the p value entirely [15]. In addition, as treatment modalities become more important and the results of studies critical to patients’ well-being, we may want to move forward with a more robust power analysis which selects of lower p value (0.01 or 0.001), following in footsteps of others who wish to minimize the type I error.

Still, banishing the p value does not guarantee a future free from statistical misinterpretation. Researchers would be remiss to believe that the introduction of increasingly complex and arguably less-intuitive tests would help make statistical analysis more accessible throughout the research community. Moreover, the p value is deeply ingrained in the curriculum of courses spanning from high school introductory biology to graduate level biostatistics. To avoid the p value throughout the vast scientific community would be incredibly challenging and would certainly make information less accessible, both by requiring further statistical specialization and by leaving future researchers ill-equipped to analyze the incredible archive of currently published literature. We must instead promote a holistic view, in which foundational issues such as study design, clinical relevance vs statistical significance, fragility, and bias receive their due attention in addition to the p value.

Whether we like it or not, we have already reached a consensus on the use of the p value. It is our common statistical language, and rather than fumble with the implementation of an entirely new language, we need to double down on increasing statistical literacy surrounding the p value at every level of education. Bolstering the p value with corrections can help decrease the effects of p-hacking and false discovery rates, and should certainly remain a topic of discussion. However, the bottom line is that p values are not inherently evil. In fact, they have helped us identify a huge knowledge gap; many researchers lack the statistical knowledge to properly evaluate their results and communicate them. In that sense, our role is actually clear; we would do better to arm physicians and physician–scientists with comprehension and clarification, and that may be the easiest way to provide those around us with the tools to improve research quality.