Rigorously conducted systematic reviews based on high-quality studies provide strong evidence for decision-making by patients, clinicians, and policy makers. However, there are a number of biases that can produce inaccurate or misleading results.1 These can lead to inappropriate treatment and misguided policy. Two common biases that can distort systematic reviews and meta-analyses are publication bias and selective outcome reporting.1, 2

Studies with small or non-significant treatment effects are more likely to go unpublished or to experience publication delay.3,4,5 Thus, published articles may report biased, overestimations of benefit. Similarly, outcome reporting bias (ORB) may occur when study authors selectively report outcomes, often based on statistical significance.2 While publication bias occurs when the entire study is unpublished, selective outcome reporting occurs when trials collect a number of outcomes and, post hoc, publish only selected ones.6,7,8 Both biases result in non-significant outcomes being omitted from meta-analyses. Comparisons of published papers to original protocols have found that ORB is common; statistically significant results are more likely to be published, protocol primary outcomes are often changed, and most trials underreport their outcomes.6,7,8 Comparisons of data submitted for drug approval and reported in subsequent publications also provide evidence of ORB and biased meta-analyses.9 There are several tests that assess publication bias (also referred to as small-study effects) including funnel plots,10, 11 nonparametric tests,2 and regression analyses.3, 4, 12 There are also approaches to adjust pooled estimates when publication bias is suspected to be present.10, 11

While the Cochrane Risk of Bias tool includes a question to assess risk of ORB, this assessment has been shown to be insensitive.13 The Outcome Reporting Bias in Trials (ORBIT) study proposed a tabular approach to assess for the presence of outcome reporting bias.14 In this method, missing outcomes are stratified as high or low risk of being biased based on reviewer judgment or contacting authors for information.

ORB is essentially a missing data problem. There are three patterns of missingness: missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR).15 MCAR occurs if the reason data are missing is unrelated to any of the data values, whether missing or observed; the missing data are a random subset of the full data. For example, an audio-recoding that is of insufficient quality to code. MAR occurs if the reason the data are missing is unrelated to the missing values but is related to (conditional on) the observed data. If one can control for the conditional variable, one can get a subset independent of the missing mechanism. An example would be residents missing an in-service examination because they are ill. While one may be able to predict missing the assessment from data on their health, it would not have been related to test scores if the resident had been present. Data are NMAR if there is an association between the value of the missing data and their likelihood of being missing (e.g., in studies on depression incidence, patients who develop depression are less likely to complete follow-up surveys). Compared to MCAR and MAR, it is more challenging to address NMAR because any method requires assumptions that are untestable with the observed data.

The best approach for correcting ORB is to obtain the unreported data from the authors. However, trials may have been conducted decades ago, authors may be difficult to locate, or they may no longer have or may be unwilling to share the data. Most systematic reviews have little success obtaining unreported data from the authors.16 Thus, statistical methods to adjust for ORB have been proposed; we focus on three approaches.17,18,19 The Copas method uses the ORBIT risk classification system to stratify the risk from ORB as high or low. For outcomes with high risk of bias, the Copas method uses a maximum likelihood approach to adjust the pooled outcome, accounting for the relative sample size of the studies with missing outcome data.20 Copas calculates a maximum bias bound, based on a worst-case scenario for missing data and adds this value to the pooled effect. This effectively moves the estimate closer to the null. The Copas method does not use the other observed outcomes in the adjustment. A web-based tool is available using this method to adjust for ORB, but at present is available only for binary outcomes.21

The Frosi method uses multivariate modeling to jointly synthesize multiple correlated outcomes. This approach imputes unreported outcome data from reported data from other studies, based on estimated within-study correlations.18 Unfortunately, the iterative algorithm may not converge and thus may produce no useable result. This commonly occurs when there are few observations. The method assumes a correlation between different reported outcomes within studies. Therefore, if other outcomes are also selectively reported, the Frosi adjustment will perpetuate ORB. Both the Copas and Frosi methods are computationally complex and are not provided by meta-analytic packages. Consequently, few systematic reviews explore the impact of potential ORB in reporting their results. A third, simpler, approach is based on the trim and fill method, which was created to adjust for publication bias. It is included in most meta-analytic packages. Trim and fill adjusts for small-study effects, and is based on a nonparametric test of asymmetry of the funnel plots, effect size plotted against study variance.22

We had two study purposes: (1) to propose a test to assess for the likelihood of ORB and (2) to propose a new, simple method to assess and adjust pooled outcomes when there is evidence that ORB exists. In addition, we compare this proposed method to previously reported ORB adjustment methods and tested our proposed method and other methods using simulation.

METHODS

We use data from our review on the efficacy of beta-blockers on migraine headache prophylaxis.23 We included published, randomized, placebo-controlled clinical trials of at least 4 weeks’ duration that evaluated prophylactic treatment of episodic migraines with beta-blockers in adults. Abstracted outcomes included the following: (1) headache frequency, (2) headache days, (3) headache index (which involved combinations of frequency, severity, and duration), (4) severity, (5) duration, (6) analgesic medication use, (7) health-related quality of life, (8) 50% headache reduction, and (9) work absence. We used headache frequency as the primary outcome in our original meta-analysis and in this study because it was the most frequently reported outcome and is the headache outcome recommended by the International Headache Society.24

Assessment for Risk of Bias

We abstracted whether these potential outcomes were reported in manuscripts or protocols as having been collected. We also abstracted whether these outcomes were reported to be statistically significant (including for outcomes without other reported data). We assessed for likelihood of ORB using the ORBIT approach (Appendix Table 1).20 We pooled data using a random effects maximum likelihood (REML) estimator based on a marginal normal distribution to estimate between-study variances.25, 26 All analyses were conducted using STATA (v 16.1, College Station, TX).

We categorized each of the nine headache variables as (1) data provided, statistically significant (p < 0.05); (2) data provided, statistically non-significant; (3) data not provided, reported to be statistically significant; (4) data not provided, reported to be statistically non-significant; and (5) data and statistical significance not reported. We compared these categories using contingency tables and calculated the relative risk of publishing significant versus non-significant findings. Outcomes that were reported as collected, but for which no further information was available, were assumed to be statistically non-significant, a common assumption.17,18,19 We tested the sensitivity of a formal test for whether data were MCAR using Little’s approach, which involves a chi-squared test for the MCAR assumption and accommodates arbitrary missing-value patterns.27

Adjustment for Outcome Reporting Bias

We created and tested a simple model to adjust for the potential effect of unreported outcomes. In this model, we assumed that all unpublished data have a null effect: θi = 0 for continuous data, standardized mean differences (SMD), weighted mean differences (MD), or log (odds ratio [OR]). We assessed the impact of including these missing data on the pooled outcome. To “fill in” missing outcomes, we also needed to estimate the standard deviations (SD) for missing data. We tested two approaches to handle missing SDs for unreported outcomes. First, we marginally fit a linear regression model for each SD to observed data SDs with the independent variables including sample size and effect size for the outcome.28 Then, we imputed the predicted value obtained from the linear model for any missing SDs.15 In our second approach, we imputed missing SDs using the multiple imputation (MI) command in STATA using additional variables (e.g., year, country). We estimated the pooled effects including these “filled-in estimates” using a REML model (meta command, STATA).

We compared our proposed approach to the Copas,17 Frosi,18 and trim and fill22 methods. The Copas method provides a bound for publication bias, ∣b∣, and a possible bound of the pooled effect as

$$ {\hat{\theta}}_p\pm \mid b\mid, \mathrm{where}\ \left|b\right|=\frac{n+m}{n}\phi \left\{{\Phi}^{-1}\left(\frac{n}{n+m}\right)\right\}\frac{\sum {\left({\sigma}_i^2+{\tau}^2\right)}^{-1/2}}{\sum {\left({\sigma}_i^2+\tau \right)}^{-1}}, $$

where n is the total number of trials in the meta-analysis, m is the number of missing studies that are high risk for ORB, τ2 is the between-study variance, ϕ(·) is the standard normal density function, and Φ−1(·) is the inverse function of the cumulative standard normal distribution.

For the Frosi method, we used the mvmeta function in STATA with a REML approach to adjust for unreported outcomes. We used the trim and fill approach in STATA’s metatrim command.

Simulation Study

To test the accuracy of the various approaches, we generated data with five outcomes and 56 study arms being compared with placebo using a multivariate normal distribution, with the mean and covariance for outcomes based on the values from our beta-blocker data set. We selected the five most commonly reported outcomes for purposes of simulation (frequency, headache index, duration, severity, 50% improvement). We then randomly deleted observations for each outcome to create a dataset with missing data for each outcome to match the number of missing outcomes in our beta-blocker set. This produced a dataset of 56 comparisons of interventions with varying numbers of missing observations. We generated NMAR data by randomly deleting a portion of the non-significant results (from 5 to 40%, in increments of 5%). We then tested all approaches (ours, Copas, Frosi, and trim and fill) to adjust for ORB for headache frequency, the most commonly reported outcome. Based on 1000 replications at each level of missing data, we compared the results to the unbiased results from the full simulated dataset without missing observations.

Sensitivity Analysis

To test the sensitivity of the various methods of adjusting for ORB, we randomly selected a subset of our simulated studies (n = 40, 30, 20, 10, 5 studies) and calculated the unadjusted pooled effect on frequency as well as the adjusted values from the four adjustment approaches.

RESULTS

We included data from 56 study arms compared with placebo from 47 randomized controlled trials of beta-blockers to prevent episodic migraine headache. Commonly collected outcomes were headache frequency and severity. Among the 286 total reported outcomes reported as having been collected, 142 (48%) provided the data in the manuscript, and the remaining 144 (52%) provided no data. Our assessment of risk of ORB is shown in Appendix Table 2.

Test for Likelihood of Outcome Reporting Bias Being Present

Among studies reporting data, 86 outcomes were statistically significant and 55 were not. Among the 142 outcomes that were reported as collected, but for which no data were provided, 8 were reported to be significant, 22 were reported to be non-significant, and 112 did not report significance (Table 2). Reported outcomes were more likely to be significant than outcomes that were collected but not reported (RR, 2.4; 95% CI, 1.9 to 2.9). This suggests that the data are not missing at random. The Little test failed to find evidence that this data was not MCAR (p = 0.43).

Efficacy Adjusted for Missing Data

Based on reported data, there was evidence that beta-blockers were effective in preventing migraine headaches for most outcomes (Table 1). Headache frequency was reduced: − 1.2 headaches/month (95% CI, − 1.5 to − 0.8). Our approach to adjust for ORB reduced the benefit to − 1.0 headaches/month (95% CI, − 1.3 to − 0.7). Our approach adjusted estimated effects toward the null for all outcomes analyzed with reductions ranging from 7 to 88% of the unadjusted values (Table 2). For estimates that used SMDs, the adjustment resulted in effect sizes deemed to be trivial for most outcomes.29 There was no significant difference in estimates of variance when using either regression or multiple imputation approaches.

Table 1 Outcomes Reported by Studies as Collected (With or Without Abstractable Data) or Uncollected
Table 2 Summary Effects, Unadjusted and Adjusted for Missing Outcomes

Our proposed approach produced values similar to those of the Copas method (Table 2). The Frosi method failed to converge for several outcomes. For headache frequency, the Frosi method resulted in no change in the pooled estimate (− 1.1; 95% CI, − 1.4 to − 0.8). For all outcomes, the Frosi method increased the pooled benefit by 13 to 215% (Table 3). The trim and fill method failed to identify the presence of biased data for any outcome.

Table 3 Impact of Varying Proportion of Missing Data (Simulated Data with 1000 Repetitions)

Simulation Results

Our simulated dataset reduced the frequency of headaches (− 1.4 headaches/month; 95% CI, − 1.8 to − 1.0; 40 studies; I2 = 96%). This served as our estimate of the “true” effect of the intervention. Table 3 summarizes the average of unadjusted and adjusted pooled effects over 1000 replications using varying rates of missing (NMAR) data. As the percentage of missing data increased, the pooled benefit also increased, from a reduction of 1.4 headaches per month (95% CI, 1.0–1.8) when 5% of outcomes were missing to the method (1.9 fewer headaches (95% CI, 1.4–2.3)) when 40% of the data were missing. Using the simulation data with missing data, both our method and the Copas method showed reductions in pooled estimates that were similar to the true values. The Frosi method yielded similar results when the missing rate was at most 25%. With higher degrees of missing data, the Frosi method resulted in adjustments that suggested greater benefit than the true effect. The trim and fill method failed to detect any evidence of ORB for all scenarios.

Sensitivity Analysis

Varying the number of included studies produced similar adjustments with both our approach and the Copas method. The Frosi method failed to converge for our 5-sample data set. The trim and fill method failed to recognize the presence of ORB in all subsamples (Table 4).

Table 4 Adjusted and Unadjusted Results for Frequency with Progressively Smaller Data Sets

DISCUSSION

In this cohort of randomized controlled trials evaluating how well beta-blockers prevent migraine headaches, significant outcomes were twice as likely to be reported than non-significant ones. Stratifying outcomes into contingency tables based on author reports is a simple method to assess the likelihood that ORB occurred. We found that our simple approach to adjusting for unreported outcomes gave results similar to those of the Copas method’s more complex approach. Frosi’s multivariate method overestimated benefits, suggesting that reported outcomes were systematically biased. The trim and fill method performed poorly and should not be used to account for ORB. While none of the adjusted outcomes resulted in significant results becoming statistically non-significant, many became clinically small or even trivial.29

In our simulation, as the unreported rate for non-significant observations increased, the pooled effect estimates became increasingly more biased. Our simple method and the Copas method provided bias corrections that brought the adjusted values close to the true pooled estimate. Frosi’s multivariate method gave similar results when the unreported rate was less than 25%. However, when the unreported rate increased, Frosi’s multivariate adjustment resulted in estimates of benefit that were greater than the “true” values. The trim and fill method performed poorly, identifying no potentially missing studies.

Our method could allow meta-analysts to quickly and simply estimate the potential impact of ORB on treatment effects with similar accuracy as prior complex methods. The simplicity of our method could allow more frequent use in meta-analyses of randomized controlled trials, which in turn could have important clinical implications. It would allow clinicians and policy makers to have a better understanding of the range of likely benefit.

Multivariate approaches to deal with missing data are useful when most data are present and there is a correlation between unreported and reported outcomes. Unfortunately, systematic reviews often address numerous outcomes, but few studies provide results from all outcomes of interest. When many outcomes are simultaneously missing (a common occurrence), multivariable approaches face convergence issues. Our data suggest that imputing missing data, based on biased datasets, amplifies the bias. This may explain why the multivariate approach (Frosi) yielded progressively more biased findings as the rate of unreported data increased.

There are several important limitations to our study. Our method shares the same limitation as the Copas method. It does not account for correlation among the various outcomes. This study was limited to a single intervention type of a single disorder. Additional evaluations of other sets of studies are necessary to test generalizability. In addition, whether our approach will work for studies incorporating non-randomized trials needs to be investigated. It is possible that unreported outcomes had benefit, but were non-significant. If so, our adjustment would be conservative and would reduce the true treatment effect more than including the unreported outcomes would have. On the other hand, if some of the missing data showed non-significant harm, our approach could underestimate the impact of missing data. By setting all missing effects to zero, we may have underestimated the true heterogeneity. Finally, clinical trial data with unreported outcomes were not obtained to assess the performance of our proposed models. Obtaining unreported outcome data from the original study authors is the best approach for dealing with data that are not missing at random. In practice, though, this is difficult to achieve.

There are possible long-term solutions. Requiring protocol registration at inception could reduce selective reporting.30 This would require that protocols be more complete and sufficiently detailed than most clinical protocol registration sites are able to enforce.31, 32 Another approach would be to require researchers to make all study data publicly available. Many researchers are nervous about this requirement, because they may have secondary analyses they hope to perform and would prefer to exhaust their data before posting it. It might be necessary to limit other researcher’s access to the data for purposes other than systematic reviews. At a minimum, researchers should annotate that additional data are forthcoming and should be required to “close out” the study, after exhausting their analyses, by posting all data to the registry. A final approach would be to require articles to list all outcomes collected with the effect estimate with confidence interval for all outcomes.14

In summary, we found strong evidence that authors selectively report outcomes in this cohort of migraine headaches being treated by beta-blockers. We evaluated a simple method to explore the impact of selective reporting in randomized controlled trials, finding that it does well as long as less than 30% of outcomes are missing. We recommend that meta-analysts consider exploring the impact of ORB on their results as a sensitivity analysis for the possibility that the results are overstated. There are several additional studies that could be considered. First, it would be interesting to assess whether making a distributional assumption about missing values centered on the null could affect the results, though this would complicate the calculations. In addition, studies need to be done to explore how well this method works for non-randomized trials and in other clinical data sets.