Physicians are faced with the challenge of assessing whether the conclusions of research studies are valid. Power, the probability that a study will detect an effect of a specified size, is analogous to the sensitivity of a diagnostic test. [1] Just as a negative result does not rule out disease when the test applied has low sensitivity, a negative study with inadequate power cannot disprove a research hypothesis. Power/sample size calculations play an important role in study planning, give readers an idea of the adequacy of the investigation, and help readers assess the validity of studies with negative results. [24] Effect size (delta) is a critical component of power calculations. Investigators choose from a wide range of possible deltas when calculating sample size. Clinicians and investigators also often struggle to determine what effect size is reasonable to expect.[2, 58] Consequently, it is important for investigators to report the effect size they wish to detect. However, this is often neglected.[8]

Sample size calculations alone are insufficient for the interpretation of studies with negative results; power and confidence intervals compliment each other and should both be reported.[6, 9] Confidence intervals take into account the data actually collected, define the upper and lower range consistent with a study's data, provide an estimate of precision, and can give readers some indication of the clinical significance of the results. [1013]

Our work adds to the literature in several ways. Several authors have found that many randomized controlled trials were underpowered, or had an unacceptable risk of missing an important effect due to inadequate sample size. [1421] Because power calculations are often complicated,[21] many readers are unlikely to have the statistical sophistication necessary to perform a power analysis. Therefore, we were interested in whether articles provided information necessary for readers to assess the validity of studies with negative results. We looked for evidence of power/sample size calculations and effect size. In addition, unlike prior work, we examined studies for documentation of confidence intervals.[22] Finally, because the calculation of sample size is applicable to all comparative studies, we did not limit our study to randomized controlled trials.[23]

Our primary objective was to quantify the proportion of studies with negative results within prominent general medical journals[24] that comment on power and present confidence intervals. Secondary outcomes were to quantify the proportion of these studies with a specified delta and a defined primary outcome.


All articles from the 1997 issues of the British Medical Journal (BMJ), Journal of the American Medical Association (JAMA), Lancet, and the New England Journal of Medicine (NEJM) were reviewed. Because the Annals of Internal Medicine (Annals) is published bimonthly, all articles from 1997 and 1998 were reviewed so as to include a comparable number of articles. One investigator (RSH) manually searched the journals and reviewed all articles for eligibility. Review articles, meta-analyses, modeling studies, decision and cost-effective analyses, case reports, editorials, letters, and studies without inferential statistics (i.e. descriptive studies) were excluded. Equivalence trials (studies designed to show equivalent efficacy of treatments) were included because power analysis, confidence intervals, and delta are particularly important to their design. Methodological issues involved in the design and analysis of these studies have been described elsewhere.[25, 26]

Articles were classified as having negative results if 1) the primary outcome(s) was not statistically significant (i.e. the article had an explicit statement that the comparison between two groups did not reach statistical significance) or 2) in those articles with no primary outcome(s), any of the first three outcomes were not statistically significant. Other outcomes were not evaluated. A second author (TAE) reviewed the full text of a simple random sample of 50 articles and the kappa statistic was calculated to assess the intraobserver variability for our classification scheme.

We examined articles to see if the authors named a primary outcome variable. We employed a decision rule, modified from Moher and colleagues, to define the primary outcome in those articles where none was specified.[19] If an article reported a sample size calculation, this was assumed to be the primary outcome.[27] If calculations were not performed, a total of three outcomes, if present, were examined. In those articles with multiple outcomes and none defined as primary, the three outcomes evaluated were the first three listed in the abstract (or result section if less than three outcomes were listed in the abstract).

The full text of included articles was systemically reviewed. Data was abstracted by a single author (RSH) and recorded in standardized fashion. Information was recorded on whether the article had a primary outcome(s), commented on power, sample size calculations, and confidence intervals pertaining to the outcomes evaluated, a projected delta, and a reason for this delta. A paper was given credit for addressing power if sample size calculations or comments on power/sample size were present. Power, sample size calculations, and confidence intervals could pertain to any one of the three outcomes evaluated and was not necessary for all outcomes.

Comparisons were made across journals by Chi-square analysis. We also assessed articles for comment on power and/or presentation of confidence intervals while stratifying by study design (clinical trials, observational studies of etiology/risk factors, screening/diagnosis, prognosis, and other). Responses were summarized as proportions and 95% confidence intervals. All data was analyzed using STATA 6.0 (Stata Corp., College Station, TX).


One thousand thirty eight articles were eligible for analysis. Two hundred thirty four (23%) were classified as negative. There was good agreement between observers in the classification of articles (k = 0.74). The percent of negative articles per journal was: Annals 20% (41/203), BMJ 22% (57/256), JAMA 23% (44/191), Lancet 22% (46/205), and NEJM 25% (46/183) (p = 0.857).

Thirty percent (70/234) of studies with negative results had comments on power and/or sample size calculations. Seventy three percent (171/234) included confidence intervals. The reporting of power (range: 15%-52%) and confidence intervals (range: 55–81%) varied significantly among journals. Twenty two percent of the studies included both power/sample size calculations and confidence intervals. There existed significant variation between journals in the reporting of power/sample size calculations and confidence intervals (Table 1). Because clinical trials (n = 87) and observational studies of etiology/risk factors (n = 109) were the predominant study designs (84% of the negative studies), articles with other study designs were not examined further. Fifty six percent (95% CI, 46–67%) of negative clinical trials and 15% (95% CI, 8–21%) of negative observational risk factor/etiology studies addressed power/sample size (p < 0.001). For reporting confidence intervals, the corresponding percentages were 79% (95% CI, 71–87%) and 75% (95% CI, 65–84%), respectively (p = 0.489).

Table 1 Negative articles addressing power/sample size and confidence intervals

Of the negative articles including information about sample size, 87% (61/70) specified a delta or the effect size that the authors sought to detect. A minority, 43% (26/61), explained the rationale behind the delta chosen. Of these, 77% (20/26) cited references or pilot studies to support their rationale.

Only 52% (122/234) of articles with negative results had a clearly defined primary outcome(s).


Many articles underreport power/sample size calculations and confidence intervals. Significant variation exists among journals. Our work demonstrates that power was reported more often in clinical trials than in observational studies of etiology/risk factors. Investigators involved in randomized clinical trials may be more familiar with the importance of power and sample size calculation.[28] Also, investigators conducting observational studies often do not have the ability to determine sample size prior to beginning their work. Most articles with sample size calculations reported a projected effect size, but only a minority shared the rationale behind this delta and even less provided empiric evidence to support the rationale.

While this manuscript describes an analysis of a large body of studies with negative results, several limitations must be considered. First, although most negative studies did not list power/sample size calculation, we cannot be certain this had not been performed a priori. It is also possible that, for the sake of brevity, authors and/or editors omitted power/sample size calculations from the final text when preparing manuscripts for submission. While it is possible these calculations were done but not reported, this may not be the case.[29] Second, our definition of a negative study may seem unduly broad. We examined three outcomes in order to classify articles because articles frequently report several outcomes, often with none defined as primary. [3033] Previous authors, limiting their work to randomized controlled trials, who have encountered multiple outcomes have defined the primary outcome as "the most clinically important"[19] or the outcome that was the "primary focus of the article".[20] These outcomes are often not possible to discern in observational studies. Nonetheless, our results may represent a best-case scenario given the publication bias against articles with negative results and the fact that we examined the more prominent general medical journals.[34]


In summary, this study demonstrates that prominent medical journals often provide insufficient information to assess the validity of studies with negative results. Authors and journal editors need to include this information so readers can be informed consumers of the medical literature.