Study interventions are often evaluated using hypothesis testing, and statistical analyses are performed to evaluate the likelihood that treatment effects are attributable to the intervention rather than chance. The P value is the method most commonly used to summarize the statistical significance of results in research publications. Within statistics, there are generally two hypotheses—the null hypothesis and the alternative hypothesis. The null hypothesis most commonly indicates no association between factors investigated.1 A P value is a strength measure against a null hypothesis.2 Thus, for example, a P value of 0.2 implies that if there is truly no difference in outcome between factors examined, the probability of seeing either the same or more extreme data is 20%. In the 1950s, Fisher proposed the concept of the P value and suggested 0.05 as a threshold of significance, implying that if the P value is less than 0.05, there is evidence to reject the null hypothesis.3 Ever since, clinical researchers have maintained the typical P value threshold of 0.05 to indicate a statistically significant result.4,5

Because of their common use in the medical literature, P values have far-reaching effects on clinical research, and thus clinical practice. Clinical practice is often based on guidelines that provide evidence-based treatment recommendations. For a half-century, randomized controlled trials (RCTs) have been considered the gold standard of medical research and are the most commonly used study design cited as supporting evidence in clinical practice guidelines.6,7 Randomized controlled trials often investigate treatment interventions, and P values are reported as evidence of intervention efficacy. Clinical practice relies on interventions found to be efficacious by way of a statistically significant difference between the intervention and standard of care groups. Thus, using the conventional 0.05 threshold as bright line criterion for statistical significance is an important consideration that warrants investigation.

Despite recognizing the importance of providing statistical significance, some researchers suggest that P values are often misinterpreted, misrepresented, and overtrusted.5,8 Many researchers make the argument that placing the P value threshold at 0.05 results in high rates of false positives.8,9,10 Others have exploited the concept of “p-hacking,” described as selective reporting of data until statistical significance is reached.11 This practice of outcome-tampering by authors is likely a result of many journals’ tendency to favour publishing studies with statistically significant (“positive”) results compared to those with “negative” ones.11 A study by Greenberg et al. determined the association between rainy days and the Society for Pediatric Anesthesia’s (SPA) annual meeting to be P = 0.006, a seemingly significant statistic.12 Nevertheless, the purpose of this study was not to point the SPA leadership’s “rainmaking abilities,” but rather to provide an outstanding representation of common shortcomings associated with P values such as publication bias, retrospective bias, and reproducibility bias.13 The risks of false positives and “p-hacking,” combined with an overreliance on P values by journals, have led to the development—not without controversy—of potential solutions to increase the validity of clinical findings. One practice that many researchers have recently supported is to lower the P value threshold of statistical significance from 0.05 to 0.005, and to reassign P values within the range of 0.05 to 0.005 as statistically “suggestive.”8,9,14,15 The researchers argued that lowering the threshold that defines statistical significance would improve reproducibility, reduce false positives, minimize p-hacking risks, and promote the conduct of more carefully designed studies with sufficient power. Moreover, it has been shown that P values offer little information regarding reproducibility of results and need to be exceedingly small to approach a desirable chance of reproducibility.13 Cumming16 found that at a two-sided result of P = 0.05, the one-sided replication probability at P < 0.05 would be 41%, at P < 0.005 approximately 80%, and at P < 0.0001 nearly 95%. Thus, the premise of transition to P < 0.005 as the significance threshold is to achieve a replication probability of 80%.

It is hypothesized that shifting a P value threshold from 0.05 to 0.005 will alter about one-third of statistically significant results from published biomedical literature.8 In addition, the effect of lowering the P value threshold to < 0.005 has been investigated in general medicine, as well as other specialized fields of medicine.14,17,18,19 Nevertheless, the effects of such a change in anesthesiology are unknown. In a systematic review, Grolleau et al. calculated the fragility index of statistically significant results from RCTs in anesthesia and critical care.20 They found that RCT results are often “fragile,” suggesting a need to increase the strength and validity of trials within anesthesia literature.

Given the influential disposition of P values within anesthesia practice, we sought to investigate how lowering the P value threshold from 0.05 to 0.005 could potentially affect the statistical significance of previously published RCTs in anesthesiology. We aimed to determine the percentage of primary endpoints stated in anesthesiology RCTs that will either remain statistically significant, be reassigned to “suggestive,” or remain statistically insignificant.

Methods

This study was conducted and is reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement. Additionally, in efforts to promote transparency and reproducibility, our protocol, data extraction form, search string return, statistical analysis plain text, and raw data have been uploaded to the Open Science Framework and made publicly available (https://osf.io/gj2w5/).21 Our protocol with detailed methodology was uploaded a priori on 9 December 2021. These documents were uploaded prior to initiating eligibility screening and data extraction, which began on 15 December 2021.

Search strategy

We used the PubMed database to search for RCTs and clinical trials electronically published (ePub) in 2020 in the three major general anesthesiology journals as indexed by both Google Metrics22 and Scimago Journal & Country Rank.23 These journals are ranked in order as follows: Anesthesiology, Anesthesia & Analgesia, and the British Journal of Anaesthesia. The following search string was used to retrieve articles: (“Anesthesiology”[Journal] OR “Anesthesia and analgesia”[Journal] OR “British journal of anaesthesia”[Journal]) AND ((clinicaltrial[Filter] OR randomized controlled trial[Filter]) AND (2020:2020[pdat])).

Inclusion criteria and exclusion criteria

To be included in our study, articles had to be RCTs; published in 2020 in Anesthesiology, Anesthesia & Analgesia, and the British Journal of Anaesthesia; have a primary endpoint; and use a P value to determine the effect of the intervention. All studies that did not specifically state a primary endpoint or that did not provide a P value were excluded.

Screening process for eligible randomized controlled trials

Following the literature search, the returns were uploaded to Rayyan (Qatar Computing Research Institute, Doha, Qatar),24 a platform for screening and selecting studies for systematic literature reviews. Title and abstract screening were conducted by two investigators (P. W. and M. L.) in a masked, duplicate manner. Following the initial screening, investigators settled any disagreements through discussion with a third investigator (B. R.) who was available for arbitration. For the studies not included in the final analysis, reasons for exclusion are detailed in Electronic Supplementary Material (ESM) eTable 1.

Data collection process

Two investigators (P. W. and M. L.) carried out masked, duplicate data extraction using a Google Form. A third independent investigator (B. R.) was available for arbitration. In addition to the P value of the primary endpoints, the following study characteristics were extracted from each trial: article title, journal name, funding source, sample size, type of endpoint (subjective or objective), type of intervention, name of intervention, and setting (single institution, multicentred, etc.). The P value for each included study is provided in ESM eTable 1. Regarding adjudication for the type of endpoint, a subjective endpoint was defined as an outcome based on or influenced by the patient’s thoughts, feelings, opinions, or interpretation (patient-reported outcome, pain, etc.). In contrast, an objective endpoint was defined as an outcome not based on or influenced by the patient’s thoughts, feelings, opinions, or interpretation (labs, imaging, physical exam, etc.). The primary endpoint and its designation as objective or subjective from each study is provided in ESM eTable 2.

Statistical analysis

We identified the ratio of endpoints that retained statistical significance with the P value < 0.005 and that were redefined as “suggestive.” We then applied a binary logistic regression model to evaluate whether particular study features were associated with P value thresholds. We used logistic regression to generalize the estimated association probabilities to the general anesthesiology literature beyond our sample of studies. Thus, for our logistic regression analysis, the study characteristic was used as the independent variable and whether the P value maintained significance or not in the proposed threshold was used as the dependent variable. All logistic regression analyses were performed using Stata version 15.1 (StataCorp LLC, College Station, TX, USA), for which an automated selection algorithm was not used. We also calculated the false discovery rate (FDR) using the Benjamini–Hochberg method in Microsoft® Excel (Microsoft Corporation, Redmond, WA, USA). This methodology was adopted from a study previously published in the Journal of the American Medical Association (JAMA) that assessed the effects of reducing the P value threshold to < 0.005 in three major general medical journals.17 It was necessary to collapse categories in the funding variable because of the small sample sizes found in some of these categories.

Results

The literature search returned 134 articles, 27 of which were excluded after title and abstract screening. An additional 16 articles were excluded during full-text screening and data extraction. Our final sample included 91 RCTs. A screening flow diagram documenting the reasons for exclusion is presented in the Figure.

Figure
figure 1

Flow diagram of study inclusion

Characteristics of included randomized controlled trials

Of the 91 RCTs, 37 (40%) were published in the British Journal of Anaesthesia, 27 (30%) were published in Anesthesiology, and 27 (30%) were published in Anesthesia & Analgesia (Table 1). Objective endpoints were the most common primary endpoints used within the studies (60/91, 66%). The most frequently studied type of intervention was drugs (44/91, 48%). Of those studies evaluating drugs, 19 related to pain, 14 to general anesthesia, ten to cardiovascular anesthesia, and one to obstetric anesthesia. Additionally, of the same 44 studies evaluating drugs, 14 were comparative studies, four were dose-finding studies, 19 looked at adverse outcomes of a specific drug, and seven looked at physiologic effects of a specific drug. Of the studies, 71% were conducted at a single centre (65/91, 71%). Of the RCTs that included a funding statement (n = 90), 39% (35/90) reported receiving internal hospital/university funding. Few studies were supported by nonprofit (2/90, 2%) and industry/private (8/90, 9%) sources. The sample sizes of the RCTs ranged from ten to 10,010 participants, with a median of 105.

Table 1 Statistical characteristics of primary endpoints

Primary endpoint analysis

Because some trials reported multiple primary outcomes, and thus multiple P values, a total of 99 primary endpoints were included in our analysis. A total of 58 (59%) endpoints had a P value < 0.05 and 41 (41%) had a P value ≥ 0.05. Of the 58 P values < 0.05, 21 (36%) would maintain statistical significance under the proposed threshold of 0.005. Of the 58 primary endpoints previously considered statistically significant, 37 (64%) would be reclassified as “suggestive.” Of these 37 RCTs, 27% (10/37) were published in Anesthesiology, 38% (14/37) were published in the British Journal of Anaesthesia, and 35% (13/37) were published in Anesthesia & Analgesia. After adjusting for covariates and using an odds ratio with a 95% confidence interval, devices and “other” interventions as well as RCTs with industry/private funding sources and multiple funding sources were found to be more likely to report outcomes that maintained statistical significance (Table 2). Nevertheless, using the P values and after adjusting for the FDR using the Benjamini–Hochberg method, there were no study characteristics that related to maintaining significance (Table 2).

Table 2 Analysis of trial characteristics and reporting a P value less than 0.005

Discussion

We found that nearly four out of five primary endpoints from all RCTs included in our study would not hold statistical significance under the proposed threshold of 0.005, and nearly two of every three primary endpoints with previous statistical significance would be relabelled as “suggestive.” The results of our study pose important implications for anesthesiologists’ drawing of critical conclusions of an RCT’s significance given the reported P value.

Our findings suggest a potential need to improve how we interpret the strength of intervention efficacy. Lowering the P value threshold to < 0.005 may serve as a temporizing mechanism until other solutions can be evaluated and implemented. For one, “p-hacking” is a practice widely used by clinical trialists and data analysts. Lowering the threshold in which a study may be considered significant would make p-hacking more difficult, likely reducing its use. Furthermore, a P value of 0.05 can lead to false positive rates as high as 30%.25 Therefore, within the clinical setting, an anesthesiologist making critical treatment decisions based on RCTs with a statistical significance set at P < 0.05 could lead to negative consequences. Statistical significance is driven by many factors. For example, a large enough sample size will almost always manifest statistically significant results.26 Additionally, given their nature, subjective outcomes suffer from poor reliability, recall, and reporting bias when compared with their objective counterparts.27 These factors may be especially relevant within anesthesia and other surgical specialties as large studies and objective outcomes are less common when compared with general medicine trials. Nevertheless, we found no correlation between sample size or type of outcome and the maintenance of statistical significance (Table 2). Thus, lowering the P value threshold would not be affected by sample size variances.

With nearly two-thirds of anesthesia trials in our study with previous statistical significance shifting to the “suggestive” category, these results underestimate the prediction made by Ioannidis that adjusting the P value would shift one-third of past biomedical literature to “suggestive.”8 In fact, the results of our study parallel those from studies in other medical specialties. Bruno et al. analyzed 202 RCTs published in obstetrics and gynecology journals between 2017 and 2019.28 Of the 90 RCTs with statistically significant outcomes (P < 0.05), nearly half would be relabelled as “suggestive.” A study evaluating RCTs within orthopedic sports medicine reported that over half of statistically significant primary outcomes would be reclassified as “suggestive.”14 Whereas these studies evaluated particular clinical specialties, a study of clinical trials published in three high impact factor general medical journals discovered that only 29.3% of statistically significant P values would be reclassified as “suggestive.”17 Our study indicates that fewer primary outcomes measured in anesthesiology RCTs would maintain statistical significance when compared with trials published in other high impact factor journals in different specialties. Our findings suggest that the clinical application of evidence-based medicine within anesthesiology would dramatically shift under a proposed lowering of the P value. As medical advancements continue to rely on the proper understanding of peer-reviewed medical literature, the classification and interpretation of clinical findings is becoming more crucial. Furthermore, as surgical procedure rates continue to rise,29 the evidence-based approach of practicing anesthesiologists becomes even more necessary for providing the most beneficial and proven therapeutic strategies.

Study limitations

This study is strengthened by sound methodology, including masked duplicate screening and extraction—the gold standard method established by the Cochrane Collaboration.30 Nevertheless, we acknowledge that this study is not without limitations. One limitation is that only the three major impact anesthesiology journals were included over a one-year period. Therefore, the results may not be generalizable to all anesthesiology RCTs. In addition, because we conducted our search through PubMed—which uses the MEDLINE database—there is a potential that our search did not return all studies published in these three journals.

Conclusion

Overall, our study found that lowering the P value threshold from 0.05 to 0.005 would alter the statistical significance of over one third of published RCTs in major anesthesiology journals. Thus, it is critical that readers consider post hoc probabilities, as well as any other attributing factors when evaluating and interpreting clinical trial results. Reducing the P value threshold and applying a reclassification of “statistically suggestive” may serve as a temporary means to reduce potential clinical trial result misinterpretation. Providing reliable information through the use of evidence-based literature is necessary for anesthesiologists to administer the most educated and skillful care to patients. Although our study shows substantial findings regarding a change to the P value in anesthesiology literature, we suggest that future studies further explore this proposal within other fields of medicine to avoid clinical misinterpretation of RCT findings and improve quality of care.