Background

Systematic reviews and meta-analyses [1] based on individual participant-level data (IPD) from randomized controlled trials (RCTs) are considered to provide the highest level of rigor for evaluating the evidence for a clinical question [2]. Such reviews offer the possibility of using hierarchical statistical techniques that better handle sources of heterogeneity, allow for sub-group analyses, and facilitate assessment of rare events. Previously, IPD meta-analyses have modified [38] or overturned [9] the results of previous meta-analyses based on the published literature alone.

Efficient and unbiased mechanisms to replicate research findings are essential for maintaining high levels of scientific credibility [10]. The premise of replication efforts is that different groups, employing rigorous methods, may take different approaches and come to different conclusions on a previously addressed question. Recent efforts to promote data sharing by the National Institutes of Health, [11, 12] the pharmaceutical industry, [13, 14] and partnerships between academia and industry [15, 16] have made replication an increasingly available mechanism to test the validity of clinical trial conclusions. This work is particularly important for systematic reviews and meta-analyses, which frequently form the basis of professional society and government guideline recommendations [17].

Previous studies have sought to determine whether systematic reviews are replicable, with new teams performing new searches, summaries, and analyses of the literature for a particular question. These studies, which compare systematic reviews of the published literature conducted at different time points, suggested that groups investigating the same research question may differ in their findings [1821], though most often, these differences were attributed to search strategy [6, 19, 2225]. However, it is uncertain if replication of meta-analyses, particularly those with the same research objectives, participant-level data, time, and funding, would employ the same analytic methods or arrive at identical findings. A thorough understanding of the reliability of meta-analysis requires an empiric assessment of how two distinct teams of investigators would employ meta-analytic techniques to address the same clinical question. Accordingly, we sought to determine if two independent centers, each of which were contracted to pursue identical research questions concurrently, with access to identical IPD, would employ identical methods in the areas of data use and statistical analysis and report identical, or at least consistent, results and conclusions.

Methods

Study design

We retrospectively compared the research methods and results of the final comprehensive publications of two meta-analyses performed in the context of full systematic reviews of recombinant human bone morphogenetic protein-2 (rhBMP-2) prepared by two independent centers, Center A [26, 27] from the University of York and Center B [28, 29] from Oregon Health & Science University, and focused on (1) meta-analysis trial inclusion criteria; (2) statistical methods; (3) summary risk estimates; and (4) conclusions.

Trial inclusion criteria were defined as study characteristics necessary for inclusion in meta-analysis. We explicitly compared, for primary and secondary endpoint meta-analyses, as well as safety analyses, the trials used by both centers for each analysis. For methods, we compared centers’ reported outcomes at various time points as well as statistical methods. We compared the centers’ risk estimates for all primary outcomes for efficacy as well as safety at all time points. In consideration of these factors, we provide a subjective comparison of the overall conclusions drawn by each center.

Conducting the systematic reviews and IPD meta-analysis

Following controversy in the literature surrounding adverse events related to rhBMP-2 including cancer, in August 2011, Medtronic agreed to participate in the Yale University Open Data Access (YODA) Project model, which has been described previously (Fig. 1) [30]. Appendix 1 provides additional context on the particular clinical controversy covered by these reviews. Our analysis will focus on systematic review reproducibility rather than this particular clinical question which has already been well described in the literature. An open request for proposal was announced by the YODA Project to solicit applications from external investigators with preliminary research aims to study the safety and efficacy of rhBMP-2. The YODA Project selected research groups from Oregon Health & Science University (OHSU) and the University of York in the UK (York). These leading centers specialize in the conduct of systematic reviews and bring internationally recognized primary investigators who have made significant contributions to methodology development for organizations including the Cochrane Collaboration and the Agency for Healthcare Research and Quality (AHRQ). Based on feedback from OHSU and York, a set of reconciled aims were developed to ensure a common scope (Table 1) [31]. Each group independently developed its protocol for conducting the systematic review and deposited the full protocol with the YODA Project. Both groups registered short versions of their protocols without detailed methods for analysis on the PROSPERO registry of systematic reviews on February 23, 2012 (CRD42012002040 and CRD42012001907).

The YODA Project transferred the full set of Medtronic data relating to rhBMP-2 to the centers in early December 2011. This included full de-identified individual participant-level data for 17 trials, consisting of 8 pilot studies, 8 pivotal RCTs, and 1 study terminated for commercial reasons. The total number of participants was 2091, consisting of 1077 rhBMP-2 recipients and 1014 control participants. Also included were protocols, data dictionaries, internal reports consisting of summaries of study data, and brief adverse event case histories. In addition, 1229 MedWatch adverse event reports submitted to the US Food and Drug Administration between July 2003 and July 2012 were provided.

Each center completed IPD meta-analyses on the effectiveness and harms of rhBMP-2 in the context of full systematic reviews. Each site was responsible for determining the appropriateness of conducting a systematic review as well as its methods and research questions within the scope of the specified research aims. The project was designed so the review groups would work in parallel and have no mutual communication about their approaches. Questions from the groups were communicated through the YODA Project review coordinator so that there was no direct communication between the groups and Medtronic.

Draft reports of comprehensive findings were received from both groups by the YODA Project in mid-August 2012. These reports were peer-reviewed by separate review teams consisting of members of the YODA Project and steering committee, which included clinical, statistical, and methodological experts, as well as by a representative from Medtronic. A peer reviewer had access to only one of the two reports at any time before final publication, and there was no communication between the separate review teams. Comments were returned to the research groups in September 2012. The groups prepared separate manuscripts for submission for publication in the Annals of Internal Medicine. Final reports of comprehensive findings, which reflected peer review comments from the journal and from the YODA Project, were received in summer 2013. These comprehensive reports, which we review in this paper, were published on the YODA Project website congruently with the articles in the Annals on June 18, 2013. The data set has subsequently been made available to additional researchers through a request process [32]. The Human Investigation Committee at Yale University determined that this study is not considered to be Human Subjects Research and did not require further review.

Fig. 1
figure 1

YODA Project timeline for the independent synthesis and meta-analysis of rhBMP-2 clinical trials, including their publication

Table 1 Explicit research aims provided to the 2 independent centers by the YODA Project

Results

Meta-analysis inclusion criteria

Trial inclusion was largely similar with a primary difference of IPD obtained from a single published RCT. Both centers chose only to include RCTs of rhBMP-2 in spinal fusion in their meta-analysis, and both groups analyzed 11 of the RCTs. Center A obtained IPD from, and included an additional non-industry sponsored RCT by, Glassman et al. [33] for its analysis of effectiveness but excluded it when looking at harms since events were reported differently and without information on when they occurred. Though Center B identified this study, it did not solicit IPD from its authors and was able to include only a qualitative analysis.

Research methodology

Research methodology differed primarily in the choice of stratification, with minor differences in the choice of statistical methods. For analyses of benefits, Center A included trials that compared rhBMP-2 with standard bone grafting techniques across all surgical approaches. As the primary analysis, Center A performed a standard two-stage meta-analysis along with a sub-group analysis that did not find evidence of differences between surgical approaches.

Center B stratified by surgical approach for effectiveness and most harms and determined that only two of the four surgical approaches (anterior lumbar interbody fusion (ALIF) and posterolateral fusion (PLF)), which were studied in multiple RCTs, provided adequate data for meta-analysis. Center B employed a one-stage meta-analysis, using mixed effects regression models. The study comparing rhBMP-2 with lumbar disc prosthesis was included in the analysis of cancer and death, which was not stratified by surgical approach.

Both centers studied the same primary outcomes for effectiveness and reported them at the same time points of 6 weeks and 3, 6, 12, and 24 months after surgery (Table 2).

Similar outcomes were also reported between the centers for harms up to 4 weeks and then up to 24 months for general adverse events, and up to 48 months for cancer and death.

Neither group found evidence of an rhBMP-2 dose-response relationship or heterogeneity in groups that received high-dose forms of rhBMP-2, so all dose formulations were combined.

For harms, Center A chose to combine all trials using a generalized mixed effects model since specific adverse events were few at the trial level. Center B also used a generalized mixed effects model with stratification by surgical approach, except for cancer and death.

Table 2 Results from meta-analyses conducted independently by Centers A and B examining measures of efficacy and safety associated with rhBMP-2

Summary results estimates

The groups obtained similar results in summary estimates of most clinical outcomes and adverse events, although there were notable differences. Center A found a statistically significant increase in fusion rate at 24 months (12% over controls) combining data across all surgical approaches. In contrast, Center B, reporting results for each surgical approach separately, did not find a significant increase in fusion at 24 months. For reducing back pain and overall disability, Center A found a statistically significant advantage for all time points from 6 months onwards when combining data from all approaches, with no statistically significant difference in the effectiveness of rhBMP-2 by surgical approach (Fig. 2). For Center B, differences for pain reduction were statistically significant from 3 months onward for ALIF, but only at the 6-month time point for PLF.

Findings were not identical for cancer; Center B reported a statistically significant increased risk at 24 months with the use of rhBMP-2, and Center A did not report at 24 months. Neither group found a significantly increased risk of cancer associated with rhBMP-2 at 48 months (Fig. 2). Both groups reported similar but not identical findings for the frequency of regular adverse events.

Fig. 2
figure 2

Forest plots from Center A and Center B meta-analyses examining likelihood of bone fusion and cancer risk associated with rhBMP-2

Summary conclusions

Center A interpreted benefits to fusion and postoperative pain as “clinically insignificant” and increased cancer incidence as “inconclusive,” noting that “whether this increased risk is genuine is uncertain” (Table 3). Overall, by this analysis alone, rhBMP-2 seemed to offer improved rates of fusion with similar clinical outcomes compared with standard techniques at the expense of increased reports of back and leg pain in the early postoperative period.

In contrast to Center A’s report, Center B found “moderate-strength evidence of no consistent differences between rhBMP-2 and ICBG in…fusion rates.” In addition, it reported a statistically significant increase in cancer at the 24-month time point, while noting that “This finding should be interpreted with caution because cases were heterogeneous.” Overall conclusions from Center B seemed to indicate more strongly than those of Center A that rhBMP-2 had no additional clinical benefit. Center B reported that its “analysis underscores that more definitive evidence about harms was needed before rhBMP-2 became widely used” and that “On the basis of the currently available evidence, it is difficult to identify clear indications for rhBMP-2 in spinal fusion. This analysis shows almost no clinical benefit for the product while raising questions about the potential risk for cancer.”

Table 3 Conclusions made by Centers A and B after conducting independent summary reviews and meta-analysis of rhBMP-2 trials

Discussion

In our study of two independent centers provided with identical objectives, data, resources, and time to conduct concurrent meta-analyses, we found that the centers did not report identical methods, results, and interpretations. In addition, the potential benefit of additional analyses of the same data was not limited solely to increasing confidence through replication. Separate analyses revealed nuances of differences, with potential interpretations for clinical management, which could be produced from the same data set using valid methods. These findings, even though largely similar, support the case for greater sharing and access to clinical data as a way to maximize public dialogue about the meaning of the data and to ensure that a single interpretation does not lead people to believe there is no other possible approach.

The centers took different but methodically defensible approaches in their attempts to best represent the results of this data set in a relevant and valid way. Review methods differed based on data stratification and IPD obtained from an additional trial. One group chose to combine data across all surgical approaches, finding little heterogeneity in trials by approach. The other group chose to stratify and analyze by surgical approach, forgoing increased statistical power in recognition of the real differences and adverse event concerns between different surgical approaches, and to present in a format perhaps more intuitive to spine surgeons. Study inclusion diverged, with one group obtaining IPD from an additional trial not funded by Medtronic and including it in the analysis of effectiveness. Even with the proliferation of standards in methodology, this demonstrates that we can expect some differences in how two similarly qualified groups might choose to conduct a complex systematic review. This diversity in methods has the potential to add to the depth of our understanding of a product and reinforces that additional value can be tapped from a data set with open access.

These differences in approach led to differences in summary estimates. In the case of the outcome of spinal fusion, this led to a difference that had statistical relevance even as the group discounted the clinical importance. Nevertheless, this finding could support the argument in the spine literature that the use of this product is warranted in certain indications and select cases where the risk of non-union is great and its consequences potentially disastrous [34, 35]. In contrast, this difference was no longer detectable when data were stratified by a surgical approach in the other review, and surgeons looking at these data alone might see fewer instances where this product would be beneficial. For estimates of cancer, there were slight differences in the time points reported. Center B showed a statistically significant increase in cancer at the 24-month time point but concurred that cancer was not significantly increased for longer follow-up. In both cases, the absolute risk of cancer was low, and the cancer types represented were heterogeneous.

In contrast to previous studies of concurrent meta-analyses in nutrition and endometrial cancer [7] and immunotherapy treatments for spontaneous abortions, this study found that concurrently conducted meta-analyses examining the same data arrived at conclusions that readers may or may not interpret similarly. Currently, it appears that nearly all meta-analyses are conducted by single groups, without replication. Additional analyses necessitate the sharing of data, and this in itself can bring important benefits. Information in the published literature is often incomplete. Data sharing has the potential to allow for a more complete picture of the benefits and harms of a treatment based on the totality of available evidence. Data are often collected or subsidized at the public expense and need to be made more widely available for the public benefit. Across a diverse array of fields, open access to data and the potential for reanalysis can, at the minimum, strengthen confidence in the findings of a systematic review while offering the potential to add to or even alter the conclusions about an intervention.

While there are many benefits and arguments for greater data sharing, these benefits must be considered in light of potential downsides that might come with additional analyses based on the same data. In this project, we addressed industry concerns around spurious analysis and litigation, as well as biased and methodologically flawed studies which might unfairly taint a product. Academia too faces challenges around credit, bias, and the potential for conflicting messages to confound decision-making. Ultimately, we believe a process of frameworks, like the YODA Project, and norms can help manage these potential problems and unlock the benefits that come with greater sharing of data.

The generalizability of our findings to other settings is not known. However, the design of our approach should have made it more likely that the results would have been the same rather than different. The two groups were provided with the same data from all manufacturer-sponsored studies which, for rhBMP-2, represented the vast majority of high-quality studies on this product. Studies of the other questions could be limited by differences in search strategies and disagreement over key studies. The groups also received identical funding from an outside organization, and neither the groups nor the funders had any financial interest in this product.

Conclusions

Two independent and expert review groups that performed independent meta-analyses of rhBMP-2 came to broadly similar findings, though with some differences on the statistical significance of primary analyses of fusion and cancer. The clinical importance of the differences may be debatable, and even the authors of this article differed in their interpretations of the results and conclusions presented in these analyses. What is certain is that the methods and interpretations were not identical and had different points of emphasis. This underscores the importance of making data more openly available for the purpose of additional scientific inquiry to maximize the knowledge that can be extracted.