Introduction

Since 1992, the Outcome Measures in Rheumatology (OMERACT) consensus initiative has successfully developed core—or minimum—sets for many rheumatologic conditions [1]. A “core outcome set” (COS) represents which outcome domains (i.e., constructs or concepts [what to measure]) and outcome measurements (i.e., how to measure]) to apply in RCTs [2].

ASAS has aimed to bring evidence-based unity to the multitude of assessments in the field of axial spondyloarthritis (axSpA). Currently, ASAS’s scope includes the entire spectrum of SpA [3]. axSpA comprises two subcategories based on the presence of structural changes in the sacroiliac joints: radiographic (r-) axSpA, implying the fulfillment of the modified New York criteria, and non-radiographic (nr) axSpA.

ASAS has selected a set of core outcome domains to include among a set of standardized measures in clinical trials, which is defined by the following scenarios: (i) symptom-modifying anti-rheumatic drugs (SM-ARD)/physiotherapy, (ii) clinical record keeping for studies, and (iii) disease-controlling anti-rheumatic therapy (DC-ART) (Fig. 1). The selected domains to include as standardized outcomes in RCTs for all three scenarios include the following: “physical function,” “pain,” “spinal mobility,” “spinal stiffness,” and “patient’s global assessment” (PGA). The core set for clinical record keeping further includes the domains “peripheral joints/entheses” and “acute phase reactants,” and the core set for DC-ART further includes the domains “fatigue” and “spine and hip radiographs” [4]. ASAS core outcome domain sets were endorsed by OMERACT in 1998 [5].

Fig. 1
figure 1

ASAS/OMERACT core domains for axSpA. Inner circle, core domains for SM-ARD/physical therapy; two inner circles, core domains for clinical record keeping; all three circles, core domains for DC-ART. SM-ARD, symptom-modifying anti-rheumatic drug; DC-ART, disease-controlling anti-rheumatic treatment

Although composite outcomes seem an attractive method to increase statistical power (e.g., BASDAI 50 response), they can mask the effect of (or absence of) the individual domains of treatment. This study therefore sets out to assess the effect of interventions for axSpA according to each core domain in the existing COS, as well as its association with the primary statistical outcome in the individual trials.

Main text

Materials and methods

We conducted a meta-epidemiological study by evaluating axSpA trials included in Cochrane reviews (i.e., Cochrane Musculoskeletal Review Group). Study selection, assessment of eligibility criteria, data extraction, and statistical analyses were performed based on a pre-specified protocol. In accordance with current methodology, the protocol is available (Supplement A) and registered on PROSPERO (CRD42018091257). The study conforms to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines for reporting systematic reviews and meta-analyses [6, 7].

Literature search

A systematic search was done on May 1, 2018, to identify all Cochrane reviews that reported interventions for the management of axSpA. Two reviewers (RAA and RC) searched directly in the Cochrane Database of Systematic Reviews, where eligible trials were identified from published Cochrane reviews (i.e., meta-analyses) after a thorough search, using the following terms: (ankylosing spondylitis OR bechterew disease OR ankylosing spondylarthritides OR axial spondyloarthritis OR axial spondyloarthritides). The most recent version of the Cochrane review was used. Unlike what was pre-specified in PROSPERO, for feasibility, we used the Cochrane Database of Systematic Reviews directly rather than PubMed, since this meta-research study’s eligibility criteria state that only trials included in Cochrane reviews would be considered for eligibility.

Eligibility criteria

Cochrane reviews that incorporated RCTs in patients with axSpA were included in our study. Only reviews with superiority trials were considered eligible. All reports for each RCT included in eligible reviews were obtained for evaluation. Non-RCTs and trials without full publications were excluded.

Risk of bias in individual studies (internal validity)

The risk of bias (RoB) within each study was assessed using the domains of the RoB tool, as recommended by the Cochrane Collaboration [8]. The bias domains included selection bias (methods for sequence generation and allocation concealment), performance bias (blinding of participants and personnel), attrition bias (incomplete outcome data), and reporting bias (selective outcome reporting). Each domain was rated as adequate, inadequate, or unclear risk of bias [9]. RAA completed all the RoB assessments and applied the RoB that was included and reported in the original Cochrane reviews as a proxy for a second reviewer assessment.

Data extraction strategy

At trial level, the terms of extraction comprised information about the first author, publication year, study duration, type of intervention, and total number of patients randomized.

The domains that were collected included the following: (i) physical function, (ii) pain, (iii) spinal mobility, (iv) spinal stiffness, (v) fatigue, (vi) patient’s global assessment, (vii) peripheral joints/entheses, (viii) acute phase reactants, and (ix) spine and hip radiographs. Furthermore, at the individual trial level, we extracted data on how many participants achieved the stated primary outcome.

If data on more than one instrument was provided for any domain, we extracted data on the scale highest on the list proposed by ASAS/OMERACT [3, 4] (Supplement B).

Trials with multiple intervention arms were treated as individual trials, referred to as “randomized comparisons” (i.e., three-arm trials with two active interventions generated two randomized comparisons with placebo). However, the number of patients in the placebo groups was divided by the number of active treatment arms, thus adjusting the standard errors in order to avoid double counting of patients [10].

Statistical analysis

Treatment effect sizes for all domains were expressed as standardized mean differences (SMDs) [11]. Standard pairwise meta-analyses for the nine domains’ SMDs with the corresponding 95% confidence interval (CI) were performed with Review Manager (version 5.3). Negative SMD values indicated a beneficial effect of the experimental intervention (e.g., pain reduction) compared with control comparator; for ease of interpretation, we used the following “rule of thumb”: SMDs of more than 0.2 represents a small effect, 0.5 a moderate effect, and 0.8 a large effect [12].

We used standard random-effect meta-analysis as the default option, whereas the fixed-effect analysis was applied for the purpose of sensitivity [13]. We used the chi2 test (Cochrane’s Q test) to assess heterogeneity and the I2 statistic to assess inconsistency [8, 13]. Anticipating substantial heterogeneity, a pre-specified number of stratified and meta-regression analyses were planned. We conducted the following stratified analyses to examine the influence of different subgroups—Pharmacological vs. Non-pharmacological treatment, and Biologic vs. Other treatment—on the effect of the interventions for all the core outcomes. Only covariates that reduced the variation (decrease in the τ2 estimated as tau squared [T2]) seen in the estimates across strata were considered potentially relevant. In trials where the primary outcome was a composite outcome, meta-regression was performed to investigate which of the nine core domains (via the available SMDs) were best associated with the primary composite endpoint of the individual trials (log [ORi]). Meta-regression was performed in a stepwise manner with the following three steps:

  1. 1.

    Each of the core domains was analyzed as the only independent variable in a univariate meta-regression analysis concerning the effect of the domains on the odds ratio (OR) for achieving the primary endpoint. Arbitrarily, it was decided that variables with a P value > 0.05 were excluded as potential explanatory variables in step 3. The analyses were based on all trials reporting the primary endpoint.

  2. 2.

    The univariate meta-regression analysis mentioned above was repeated, but only trials reporting all the domains affecting the log [ORi] (P < 0.05) were included in the analysis.

  3. 3.

    The explanatory core domains from step 1 (P < 0.05) were analyzed as the independent variables in a multivariate meta-regression analysis.

These meta-regression analyses enabled us to explore which core domains were best reflected in the composite endpoint of axSpA and what is lost when we neglect core domains by using only one composite outcome endpoint.

Patient perspective

As part of the author team, MdW—an experienced patient research partner (PRP)—was consulted to review and elaborate on the protocol and confirmed the importance of the study from the patient’s perspective. MdW was involved throughout the research process as a scientific collaborator and voluntarily participated in the process of designing and preparing the study protocol and in interpreting results. Where feasible, we followed the EULAR recommendations for PRPs [14].

Results

Study selection

The search was carried out directly in the Cochrane Library on May 01, 2018. As illustrated in Fig. 2, the inclusion criteria identified twelve Cochrane reviews; one review was excluded based on the title and abstract [15]. Eleven reviews were thus retrieved for full-text examination [16,17,18,19,20,21,22,23,24,25,26]. After full-text examination, we excluded another six reviews (three reviews did not include axSpA trials [21,22,23]; two reviews were protocols only [25, 26]; one review did not report results for axSpA separately [24]). A total of five Cochrane reviews [16,17,18,19,20] with 85 possible trials were identified for inclusion. We excluded 35 studies—3 were not RCTs and 32 were not superiority trials—thus, 50 trials were found eligible for the qualitative synthesis (for reference list of included studies, see Supplement B). Of these 50 trials, 7 were not included in the quantitative synthesis: 6 trials reported most of the data as graphs, and data were not extracted [27,28,29,30,31,32], and one trial was excluded due to language restriction (Chinese [33]). The 43 included RCTs comprised a total of 63 comparisons. The interventions were categorized into three treatment groups: non-pharmacological (NP) modalities, pharmacological (P) modalities, and biological (B) modalities.

Fig. 2
figure 2

Flow chart. M0, identified Cochrane reviews; M1, possible eligible reviews; M2, included reviews; K, trials from included Cochrane reviews, k*, trials included in the evidence synthesis

Study characteristics

The characteristics of the eligible trials are summarized in Table 1.

Table 1 Study characteristics and risk of bias assessment of included studies

Twenty-two trials (42%) used an adequate concealment of allocation and sequence generation (selection). Twenty-seven trials (54%) were judged to have adequate blinding of participants and caregivers (performance), and 34 trials (68%) adequately addressed incomplete outcome data (attrition). Eight trials [29, 34,35,36,37,38,39,40] (16%) were unable to provide the data of all the pre-specified outcomes, and we judged them at high risk of selective outcome reporting bias.

Characteristics of the core outcome measurement set

The outcome matrix (Table 2) shows which core domains were measured for each trial and by which measurement instrument, differentiating between those which were fully and partially reported. Overall outcome reporting was good for SM-ARD/physical therapy trials; mean (SD) number of ASAS/OMERACT core outcome domains measured for SM-ARD/physical therapy trials was 4.2 (1.7), and six trials assessed all five proposed domains. Mean (SD) number of ASAS/OMERACT core outcome domains measured for DC-ART trials was 5.3 (1.8). No DC-ART trial assessed all nine domains. The most commonly measured domain was spinal mobility (88%) which was assessed followed by pain (86%). Most studies also included measures of physical function (78%), spinal stiffness (76%), acute phase reactants (70%), and patient’s global assessment (62%). The instruments used to measure the domains varied widely across trials. For the domain fatigue, only seven trials (14%) had reported this measure separately. Spine radiographs were also poorly represented (2%). None of the trials reported hip radiographs.

Table 2 Outcome matrix

Physical function

All meta-analyses are shown in Supplement B.

A total of 33 RCTs (43 comparisons, 4819 participants) were included in the meta-analysis. As presented in Table 3, the overall analysis of change in physical function (PF) showed an SMD of − 0.50 (95% CI, − 0.61 to − 0.40), indicating moderate effect in favor of participants receiving intervention compared to participants receiving control. A high between-study heterogeneity was observed, τ2 = 0.07, with substantial inconsistency across studies (I2 = 64%). However, the fixed-effect analysis was in agreement with the random-effect model, resulting in a pooled SMD of − 0.53 (− 0.60 to − 0.47). The stratified meta-analyses for PF did not result in a significant reduction of τ2; type of intervention did not seem to be an important factor to the inconsistency observed across axSpA trials, when measuring change in PF.

Table 3 Results of the stratified meta-analyses

Pain

In total, thirty trials (41 intervention comparisons, 4877 participants) were included in the analysis. Pooled analysis revealed statistically significant reduction in pain with an overall SMD of − 0.48 (− 0.66 to − 0.30), indicating moderate effect across all interventions in axSpA trials. Between-study inconsistency was substantial (I2 = 62%). A large reduction in heterogeneity was found in the “type of intervention variable” [58] (i.e., non-pharmacological [58] vs. pharmacological [P]), which in turn resulted in a significant reduction in τ2 at 32%, supported by a statistically highly significant P value (P < 0.001) for interaction between NP and P. Trials with pharmacological interventions had a pooled SMD of − 0.64 (− 0.78 to − 0.49), whereas trials with NP interventions had an overall SMD of 0.26 (− 0.20 to 0.72).

Spinal mobility

Forty-three trials (45 comparisons, 5091 participants) were included in our meta-analysis. Pooled analysis revealed a small improvement in spinal mobility (SM) with an overall SMD of − 0.32 (− 0.48 to − 0.17). A high between-study heterogeneity was observed, τ2 = 0.21, with a large inconsistency across studies (I2 = 83%). None of the subgroup analyses resulted in a significant reduction in τ2.

Spinal stiffness

In total, 25 trials (34 comparisons, 3658 participants) were included in the analysis. The overall effect size revealed substantive statistically significant improvement in spinal stiffness, SMD of − 0.59 (− 0.74 to − 0.44), indicating moderate effect of all interventions in axSpA trials. The heterogeneity was large, τ2 = 0.14, with substantial inconsistency between studies (I2 = 75%). The pre-specified stratified analyses did not result in a significant reduction of τ2, and type of intervention did not seem to be an important factor to the inconsistency observed across trials when measuring change in SS in axSpA trials.

Patient’s global assessment

Twenty-one RCTs reported sufficient data to be included in the meta-analysis (28 comparisons, 4031 participants). A significantly pooled moderate effect favoring intervention with large inconsistency was observed, SMD − 0.71 (− 0.89 to − 0.54) and I2 = 83%. Stratified analyses did result in a significant reduction of τ2; the type of intervention did not seem to be an important factor to the inconsistency observed across trials when measuring change in PGA in axSpA trials.

Peripheral joints and enthesitis index

Fifteen trials (2334 participants) were included in our meta-analysis. A high between-study heterogeneity was observed, τ2 = 0.05, with a substantial inconsistency across studies (I2 = 68%) [13]. There was no significant difference in the joint count/enthesitis index after the interventions; the overall SMD was 0.05 (− 0.11 to 0.22). A large reduction in heterogeneity was found in the “type of intervention variables” (i.e., NP vs. P treatments and B vs. other treatment [O]), which in turn resulted in significant reductions in τ2, supported by statistically significant P values for interactions between NP/P and B/O (P < 0.001 and P = 0.03, respectively). Treatment with a biological agent had a small effect on reducing the number of swollen joints in axSpA patients, SMD − 0.10 (− 0.20 to − 0.01).

Acute phase reactants

Twenty-seven trials (31 comparisons, 3869 participants) were included in our meta-analysis. The overall analysis of change in APRs showed a moderate all-in-all effect for all interventions in axSpA trials, SMD = − 0.51 (− 0.70 to − 0.32), with a large inconsistency (I2 = 84%). A large reduction in heterogeneity was found in the “type of intervention variable” (i.e., B vs. O treatments), which in turn resulted in a significant reduction in τ2 at 45%, supported by a statistically significant P value (P = 0.001) for interaction between B and O treatments. Trials with a biological intervention had a large effect on reducing APRs, SMD of − 0.77 (− 1.02 to − 0.52), whereas trials with other treatment interventions (i.e., NSAIDs, MTX, SSZ, and NP) had an overall small effect, SMD = − 0.22 (− 0.37 to − 0.07).

Spine/hip radiographs

Only two of the included trials reported a change in spine radiographs (SR). One trial reported insufficient data to be included in the meta-analysis. In total, only one trial with 32 axSpA patients was included in the analysis. The effect size was 0.96 (0.22 to 1.6), indicating SSZ did not have an effect on preventing spinal progression. No trial reported hip radiographs.

Fatigue

Three studies reported sufficient data to be included in the meta-analysis (4 comparisons, 653 participants). The overall SMD was − 0.65 (− 0.82 to − 0.48), and no between-study inconsistency was found (I2 = 0%). The pre-specified stratified analyses performed with regression models did not influence the variation in the estimates across strata and were not considered relevant.

Association with primary endpoint

Overall, 27 trials (39 comparisons) stated explicitly what the primary endpoint measure was and reported the proportion of participants achieving the primary endpoint. The most commonly composite primary outcome was the ASAS 20 response criteria (56%) followed by the change in BASDAI (37%). Two studies (7%) used a customized composite outcome (e.g., the overall change in PGA).

In total, 5723 axSpA patients were included in the meta-analysis. The pooled OR for achieving primary endpoint was 3.26 (2.58 to 4.13) in favor of participants receiving experimental intervention compared to participants receiving a control comparator.

Univariate meta-regression analyses based on all trials (i.e., trials that had a measured composite primary endpoint) indicated that a reduction in pain and APRs and improvements in PF and PGA were significantly associated with increased odds for achieving primary endpoint, whereas SM, SS, PJ/E, and fatigue were not (Table 4). We repeated the meta-regression analysis based on trials reporting all four domains significantly affecting the OR for achieving primary endpoint. PF, pain, and PGA were still significantly associated with the OR for achieving primary endpoint, whereas APRs proved non-significant. Multivariable meta-regression analyses showed that PF did not have a statistically significant explanatory effect on achieving primary outcome when the following explanatory core outcome domains pain, PGA, and APRs were added to the model simultaneously. Only reduction in pain and PGA had a statistically significant effect on the OR for achieving primary outcome, regardless of analysis.

Table 4 Overview of the impact of core outcome domains on the odds ratio (OR) for achieving primary endpoint per trial

Discussion

This meta-research study aimed to assess the effect of interventions for axSpA according to each core domain in the existing ASAS/OMERACT-endorsed core outcome set. The eligible studies reported data only for patients with r-axSpA. The most frequent domains assessed in the included trials were SM and pain, which are considered prominent features for axSpA [76]. Overall outcome reporting was surprisingly good for SM-ARD/physical therapy trials, especially considering that most of the included studies were published prior to implementation of the COS.

The overall reporting for the included DC-ART trials was sparse. Surprisingly, none of the trials measured all the nine proposed domains. Fifteen (30%) of the included studies were published before the COS was suggested by ASAS and endorsed by OMERACT, possibly explaining the lack of measured domains.

We found that all interventions, both non-pharmacological and pharmacological, when compared to control, resulted in an overall statistically significant reduction in pain related to axSpA, SS, fatigue, and APRs and an improvement in PF, SM, and PGA in axSpA trials. Due to our broad eligibility criteria where the type of interventions varied greatly among RCTs, the high between-study heterogeneity observed was not unexpected. However, type of intervention did not result in significant change in τ2 for all the domains. For the domain “PF,” the overall effect size was moderate regardless of type of intervention. Our meta-analyses provided evidence that interventions in axSpA trials did not result in an overall reduction in swollen peripheral joint count/enthesitis index (PJ/E) or spinal progression more than placebo. However, when stratifying on type of interventions, it seemed that biological treatment had a larger effect on reducing the number of swollen PJ/E. However, one should be cautious to conclude that biologics are superior to other pharmacologicals for treating inflammation in PJ/E, as our meta-analysis included only a limited number of trials. Radiographic progression was measured in only two trials and fully reported in one. Given that most trials spanned 26 weeks or less, it is not surprising that they did not measure radiographic progression. MRI is an important imaging tool to assess axSpA, especially early in the disease course, before radiographic damage is apparent. Adding magnetic resonance imaging (MRI) to the domain “spine radiographs” could prove useful, as MRI is commonly used in short-term axSpA trials [77]. However, there is currently no consensus on how to monitor treatment response using MRI modalities in axSpA patients [78].

For transparency, we believe all domains and instruments used in trials should be reported. We found that domains and instruments sometimes were used but not reported separately. For example, the domain “fatigue,” which is included in the Bath Ankylosing Spondylitis Disease Activity Index (BASDAI), was rarely reported separately, whereas the domain “spinal stiffness,” also included in BASDAI, was reported in half of all the studies.

As with previous findings, this meta-epidemiological study found that pain and PGA are important predictors for treatment responses in axSpA trials [79], thus emphasizing the value of reporting core domains separately.

Outcome reporting bias (ORB) can affect the quality of evidence within a systematic review and meta-analysis [80]. We found a high suspicion of selective ORB in eight (16%) of the individual included RCTs. In most cases, it was not possible to make a clear judgment on reporting bias due to the lack of published protocols in this context. Where protocols were available, there was no evidence of selective reporting. If a composite outcome (e.g., BASDAI) was reported but no data on any of the individual core outcome measurements (e.g., fatigue) were reported, then we judged ORB as low risk; it might not have been the trialists’ intention to analyze the individual core outcomes separately. If a study reported some of the outcomes from the composite outcome measurement, then we judged ORB as a high risk, as it is likely that all the core outcome measurements were analyzed but some were not reported because of non-significant results. In many of the individual trials, all of the outcome domains were not mentioned, thereby requiring clinical judgment to decide whether the outcome of interest was likely to have been measured for a particular trial.

A limitation of this study was that we did not contact the trialists to determine whether outcomes were measured; many of the studies were published over 15 years ago, and it would have been difficult to locate the trialists. Another limitation of this study is that our results are based on axSpA trials included in Cochrane reviews, and therefore, we did not have control over the literature searches used. However, Cochrane reviews are known for the quality of their searches, and we consider the trials included in our study to be representative and our results to be generalizable. We used the SMDs to meta-analyze outcomes involving the same or similar constructs. We did not include absolute changes and reported in units/percentages of the most common instruments that the clinicians will understand. However, SMD is more generalizing and can be interpreted using a general rule of thumb reported by Cohen, in which an SMD of 0.2 represents a small effect, an SMD of 0.5 represents a medium effect, and an SMD of 0.8 or larger represents a large effect [12].

Conclusions

Although all types of axSpA conditions were eligible, the analyses were limited to patients with r-axial SpA (AS), since none of the eligible studies included patients with non-radiographic axSpA which could be either be perceived as a limitation or simply a consequence of the axSpA history reflected in the existing Cochrane reviews. Consistent outcome reporting for DC-ART trials was poor. The most responsible core domains for achieving success in meeting the primary objective per trial were pain and PGA. Our findings support that PGA and pain give us a more holistic assessment of disease beyond objective measures of spinal inflammation.

Outcome reporting bias and “missing data” could be reduced by implementing the endorsed ASAS/OMERACT COS of outcomes—and thereby improving the precision of results in meta-analyses.