Background

Thresholds of meaningful score change for patient-reported outcomes (PROs) are of crucial importance, particularly when assessing and interpreting treatment benefit. Within-patient meaningful improvement (WPMI) represents the smallest difference in an outcome measure which is considered by patients to be beneficial [1, 2]. As recommended by the United States Food and Drug Administration, appropriate thresholds that indicate clinically meaningful within-patient change should be established a priori via anchor-based methods, using anchors such as the Patient Global Impression of Severity [3]. These thresholds for WPMI can subsequently be used to interpret clinical trial data.

The Short-Form 36 Health Survey version 2 (SF-36v2) and Functional Assessment of Chronic Illness Therapy-Fatigue (FACIT-Fatigue) are both PRO instruments used to quantify key concepts important to patients with various diseases, including rheumatoid arthritis (RA) [4]. Although both are well-established instruments, this study aims to contribute to the ability to interpret results obtained from these instruments by estimating thresholds for WPMI. More recently, the Rheumatoid Arthritis Symptoms and Impact Questionnaire (RASIQ) was developed to specifically evaluate the symptoms of RA and their impact on patients [5]. Establishing interpretation thresholds of the score change for both new and established PRO instruments furthers understanding of results obtained from these PROs.

This post-hoc analysis used data from the Phase 2 BAROQUE (NCT02504671) [6] and RENAISSANCE (NCT02799472) [7] otilimab trials to determine the WPMI thresholds for SF-36v2, FACIT-Fatigue, and RASIQ among patients with RA. This study extends prior work by comparing previously established interpretation thresholds for SF-36v2 and FACIT-Fatigue [8,9,10] to those obtained using data from the otilimab trials, and establishing WPMI thresholds for RASIQ.

Methods

WPMI thresholds for SF-36v2, FACIT-Fatigue, and RASIQ were established using anchor-based methods, with supportive distribution-based methods and measures of accuracy (sensitivity and specificity) used to further triangulate across the estimates obtained from different anchors. Cumulative distribution function (CDF) plots were also generated to illustrate how well anchor-based change categories were separated across the entire range of RASIQ scale change scores.

Survey content and scoring

The SF-36v2 is a 36-item, self-report survey of functional health and well-being that is scored as two component summary scores (physical and mental health) and as eight domain scores; physical functioning (PF), role limitations due to physical health (RP), bodily pain (BP), general health perceptions (GH), vitality (VT), social functioning (SF), role limitations due to emotional problems (RE), and mental health (MH) [11].

For the eight domain scores, results are presented using a score range from 0 (worst possible health) to 100 (best possible health). Additional File 1 reports results for norm-based scores (NBS), which standardize scale and component scores using the means and standard deviations (SD) from a US general population normative sample [11]. The Physical Component Summary (PCS) and Mental Component Summary (MCS) scores are always based on NBS, using a mean of 50 and a SD of 10 in the US adult general population, with higher scores indicating better health.

The 13-item FACIT-Fatigue questionnaire assesses self-reported fatigue and its impact upon daily activities and function over the past 7 days; item responses are added with equal weight to obtain the total score which ranges from 0 (most fatigue) to 52 (least fatigue) [12].

RASIQ is a novel measure comprised of 16 items across three domains (Joint Pain [JP], Joint Stiffness [JS], and Impact [IM]). Scores from each item are summed and transformed to a metric ranging from 0 (least pain/stiffness/impact) to 100 (most pain/stiffness/impact) [5].

Data sources

BAROQUE [6] was a randomized, Phase 2b, dose-adaptive, multi-center, double-blind, placebo-controlled trial which assessed the efficacy of the anti-granulocyte-macrophage colony-stimulating factor monoclonal antibody, otilimab, in patients with active, moderate-to-severe RA despite treatment with methotrexate. RENAISSANCE [7] was a Phase 2a, multi-center, double-blind, placebo-controlled trial which evaluated change from baseline in various exploratory biomarkers among patients with RA treated with otilimab. While both trials included the RASIQ, the SF-36v2 and FACIT-Fatigue were only used in the BAROQUE trial, and completed at baseline, Weeks 4, 12, 24, 36, and 52, and follow-up. The RASIQ was completed at screening, baseline, Weeks 1, 6, and 12, and follow-up in the RENAISSANCE study, and at Weeks 1, 12, 24, 36, and 52, and follow-up in the BAROQUE trial.

Data from baseline to Week 24 in the BAROQUE trial were used in the SF-36v2 and FACIT-Fatigue analyses. Pooled data from baseline to Week 12 in the BAROQUE and RENAISSANCE trials were used in the analyses of RASIQ.

Anchor items

The general anchor for all SF-36v2 scales was the Patient’s Global Assessment of Disease Activity (PtGA), with scores ranging from 0 (very well) to 100 (very poor). In addition, Patient’s Assessment of Arthritis Pain (PAIN; scores range from 0 [no pain] to 100 [most severe pain]) was used as an anchor for the BP scale. One item from the FACIT-Fatigue questionnaire, AN5 (I have energy; Not at all / A little bit / Somewhat / Quite a bit / Very much), was used as an anchor for the VT scale.

The PtGA item and two items from the SF-36v2 were used as anchors for FACIT-Fatigue. The first SF-36v2 item assessed the Patient’s Global Impression of Status (PGIS; In general, would you say your health is: Excellent / Very good / Good / Fair / Poor), and the second item focused on fatigue (How much of the time during the past 4 weeks did you feel worn out? Not at all / A little bit / Somewhat / Quite a bit / Very much).

WPMI analyses of RASIQ were based on the SF-36 PGIS, PtGA, PAIN, and additional items on pain and overall impact. The SF-36 PGIS and PtGA were used as anchors for all RASIQ scales. The PAIN and one SF-36v2 item focused on pain (How much bodily pain have you had during the past 4 weeks? None / Very mild / Mild / Moderate / Severe / Very severe) were used as additional anchors for the JP scale, and two FACIT-Fatigue items (I feel tired and I feel listless [washed out]: response scale for both; Not at all / A little bit / Somewhat / Quite a bit / Very much) were used as additional anchors for the IM scale. Full details of the anchors used are shown in Table 1.

Table 1 Anchors used for SF-36v2, FACIT-Fatigue, and RASIQa

For categorical anchors, a one-point (or one-category) improvement was deemed to be associated with the smallest meaningful change indicating improvement. The categorizations of change groups for anchors that used a continuous metric were based on results from studies that established thresholds for within-person change for the same measure and among a sample of patients with RA [13]. For PtGA, a value of -18 was used, and for PAIN, a value of -20 was used.

Statistical analysis

The association between change in each PRO score and the proposed anchors was evaluated using the Spearman correlation coefficient with a recommended value of at least 0.30 indicating adequacy of the anchor [14, 15]. WPMI was estimated as the mean score change from baseline to Week 12 or 24 in the group associated with the smallest meaningful improvement in each corresponding anchor. Effect sizes were calculated using standardized response mean (SRM) to better compare the magnitude of the mean change scores, using:

$$ SRM=\frac{{\stackrel{-}{X}}_{change}}{{SD}_{change}} $$

where the numerator consists of the mean of the change scores and the denominator is the SD of the same change score.

The reliable change index (RCI) was used to identify change that can be considered beyond measurement error [16]. First, the standard error of the measurement (SEM) was calculated using:

$$ SEM= {SD}_{baseline}\text{*}\sqrt{1-reliability }$$

Reliability was estimated using Cronbach’s alpha [17], a measure based on inter-item correlations. As a sensitivity analysis, reliability was also estimated using the omega coefficient [18] and the greatest lower bound [19]. These analyses gave very similar results. Next, the RCI was calculated using:

$$ RCI= \sqrt{2 } \times SEM \times 1.282$$

In the equation above, 1.282 is taken from the standardized normal distribution; it represents the half-width of the 80% confidence interval, which is a reasonable criterion for individual respondents proviiding an appropriate balance between the risks of falsely identifying change and overlooking true change [11]. Half of a standard deviation (based on baseline scores) is also reported for completeness, as this has been advocated by researchers in the field [20].

Sensitivity and specificity were used as measures of accuracy to characterize and compare the various anchor-based estimates. Sensitivity indicates the likelihood of correctly identifying a truly improved individual, while specificity indicates the likelihood that an individual that has not improved is correctly classified as such. For the current analyses, the anchor was used as the gold-standard, while the PRO measure was used as the classification or ‘test’ variable.

The CDF plots of change scores were used to better understand the separation between anchor-based change groups across the entire range of observed RASIQ change scores. CDF plots were obtained for each RASIQ scale using the respective anchors, focusing on the anchor category where the patients are defined by the anchor measure as having experienced meaningful change. A consistent separation across the score range between the curve for this category of change and those of adjacent groups indicates support for the anchor.

Results

Estimated WPMIs of SF-36v2

The correlation between the SF-36v2 PF, RP, BP, VT, SF, and PCS change scores and change in PtGA ranged between -0.30 and -0.48 (absolute value), as shown in Table 1. For the four remaining SF-36v2 scales (GH, RE, MH, and MCS), correlations ranged between -0.22 and -0.29, indicating that the PtGA is not an empirically adequate anchor for these scales.

Anchor-based WPMI values for the eight SF-36v2 0–100 domain scores ranged between 13.6 for the GH scale (with an SRM of 0.87) and 26.6 for the BP scale (with an SRM of 1.73) (Table 2). PCS and MCS had WPMI estimates based on NBS of 9.7 and 7.6, and SRM of 1.47 and 0.70, respectively. The accuracy measures of these threshold values for identifying meaningful improvement indicated that for most scales, the thresholds have better sensitivity (0.66 to 0.87) than specificity (0.43 to 0.58).

Table 2 Anchor- and distribution-based estimates for the SF-36v2, FACIT-fatigue, and RASIQ

RCI-based estimates were 12.7 for PF, 10.7 for RP, 11.1 for BP, 16.7 for GH, 13.1 for VT, 19.7 for SF, 12.1 for RE, 13.3 for MH, 4.1 for PCS, and 6.0 for MCS (Table 2). Estimates based on 0.5 SD were 9.8, 9.1, 7.6, 8.2, 8.3, 10.5, 12.4, 9.5, 3.4, and 5.4, respectively.

WPMI estimates based on mean change were generally similar, although slightly smaller in some cases (e.g., for the MH scale), to those provided by the cut point associated with the best balance between sensitivity and specificity (Additional File 2, Supplementary Table S1). CDF curves generally mirrored correlation values, with PtGA-based curves being less separated for GH, SF, and RE domains (Additional File 2, Supplementary Figure S1); similarly, PtGA-based CDF curves for PCS were more clearly separated when compared to MCS (Additional File 2, Supplementary Figure S2).

Estimated WPMIs of FACIT-fatigue

Anchor-based WPMI estimates ranged from 9.7 to 11.3 (SRM 0.99–1.15; Table 2). RCI generated a value of 4.9. The cut point associated with the best sensitivity/specificity balance was slightly smaller than the values obtained with mean change analyses (Additional File 2, Supplementary Table S2). A clear separation between all CDF curves was observed (Additional File 2, Supplementary Figure S3).

Estimated WPMIs of RASIQ

Joint pain scale

Analysis of mean change scores for a one-point improvement in the SF-36 PGIS indicated that a meaningful improvement in the RASIQ’s JP scale was equal to a 24.0-point reduction in score (Table 2). The estimate for the BP anchor (BP01) was -21.7 while the two anchors that are based on a binary categorization of a continuous scale (PAIN and PtGA) provided higher WPMI estimates (-32.7 and -31.0, respectively).

RCI generated a value of -6.8. A clear separation between the CDF curves was observed (Additional File 2, Supplementary Figure S4).

For the values found in the anchor-based analyses (Table 2), sensitivity (range: 0.73–0.94) was higher than specificity (range: 0.49–0.58). For example, with SF-36 PGIS as the anchor, at a threshold of -24.0, the sensitivity was 0.73 while the specificity was 0.56, indicating better performance at correctly classifying patients who have improved than those who have not improved.

Joint stiffness scale

Analysis of mean change scores based on an improvement of one point or better in SF-36 PGIS indicated that a meaningful improvement in the RASIQ JS scale was equal to a 23.3-point reduction in score (Table 2). When using PtGA as the anchor, the estimate was approximately 3 points higher (-26.1) in absolute value. Estimates based on RCI (-13.1) were approximately half of those obtained under the mean change score analysis, with the estimate based on 0.5 SD equal to -9.1.

A clear separation between the CDF curves was observed (Additional File 2, Supplementary Figure S5).

Impact scale

Analysis of mean change scores based on an improvement of at least one point in SF-36 PGIS indicated that a meaningful improvement in the RASIQ’s Impact scale translated to a 21.0-point reduction in score, which was nearly identical to the estimate obtained under the PtGA anchor (Table 2). The remaining two anchors, AN2 (I feel tired) and AN1 (I feel listless), resulted in estimates that were slightly smaller (in absolute value) at -17.4 and − 17.8, respectively. The 0.5 SD and RCI criteria resulted in values of -7.8 and -12.8, respectively.

The CDF plots indicate that the curves obtained for each change group were generally separated, except for the plot corresponding to the AN1 anchor (Additional File 2, Supplementary Figure S6).

At a threshold of -21.0, the sensitivity and specificity were 0.89/0.48 when SF-36 PGIS was the anchor; the estimate of -21.1 associated with the PtGA anchor resulted in values of sensitivity and specificity equal to 0.93 and 0.44, respectively. For the smaller WPMI estimates based on the two FACIT-Fatigue items– AN2 and AN1– sensitivity values were slightly lower (0.85/0.79) and specificity slightly higher (0.57/0.58).

Discussion

Understanding the thresholds for within-patient meaningful change scores for PRO instruments is important for assessing and interpreting benefits of a treatment. In this study, we sought to determine the WPMI thresholds for SF-36v2, FACIT-Fatigue, and RASIQ among patients with RA. As the RASIQ is a new questionnaire that was designed specifically for RA, this research will allow increased use of the measure in the future.

For SF-36v2 NBS scores, most of the WPMI estimates obtained in the current study (using mean change score in the anchor category) were similar or up to 2 times greater than those recommended by the developers, which were derived from the US general population (average number of chronic conditions reported was 2.6 [SD = 2.5]) [11]. It should be noted that the thresholds for within-individual change recommended by the developers were based on SEM around the change score (similar to RCI) rather than confidence intervals for observed change based on patient-rated anchors. WPMI estimates for SF-36v2 items based on 0–100 scores were substantially greater (by a magnitude of 3 to 5 times) than those that have been applied to RA trial data, which identify meaningful within-patient change using a change score of 5 points for the eight SF-36v2 scales [21, 22]. For FACIT-Fatigue, the WPMI estimates ranged between 9.7 and 11.3; again, these estimates are higher than those used in previous studies with patients with RA [22]. A couple of factors should be noted as likely contributors to overestimation of WPMI. Firstly, the operationalization of PtGA and PAIN (i.e., their dichotomization) did not distinguish between a large and a small improvement in health status; however, the analysis of mean change assumes a category of small but meaningful change. In addition, simulation studies have shown that mean change analyses often overestimate the threshold for meaningful change [23].

Factors beyond methodological aspects of the analyses should be noted as potential drivers of the differences between current and previous results. Earlier studies have used different methods/anchors, whereas the anchors used in this study were specific for patients with moderate/severe RA. Patient demographic and clinical characteristics can also influence WPMI estimates. In addition, the commonly used 5-point thresholds for 0–100 scores of the eight domains of the SF-36v2 and 2.5-point thresholds for the two summary measures (PCS and MCS), as well as those recommended for NBS, were established some time ago and have not been frequently re-evaluated, particularly in the RA patient population. Over time, meaningful improvement scores may have changed with the improvement of treatments, more effective patient care, and increased patient awareness of disease management. A likely driver behind using the 2.5- and 5-point thresholds is the metric underlying these scores, rather than empirical findings based on analyses similar to those carried out in the current study. NBS scores are set to have a mean of 50 and a SD of 10 (based on the US general population); 0.5/0.25 of a 10-point SD is ~ 5 points and 2.5 points, which have been common metrics [24, 25].

Based on available anchors, of which SF-36 PGIS was considered the primary anchor, our analyses for RASIQ indicate a change between approximately -33 and -22 points in the JP scale score (range: 0–100) could be interpreted as being meaningful for patients; for the JS scale (range: 0–100), this range was approximately -24 to -27, while for the IM scale (range: 0–100), the range of change scores was approximately -21 to -17. For all three scales, distribution-based results indicated that the changes within these ranges were well beyond error that would occur by chance in the measurement process. Overall, anchor-based estimates were associated with high values for sensitivity, indicating that the WPMI estimates were good at identifying patients who improved; values for specificity were low, indicating that these thresholds may have included a lot of patients that were not “truly” improved. Further studies and/or assessment with other measures is therefore warranted.

Limitations

Only anchors that were included in the two Phase 2 trials were available in the current study. These anchors were not specifically developed for the purposes of deriving WPMI thresholds and did not include patients’ direct assessment of change. As a result, for some SF-36v2 domain scales, the anchor used was not sufficiently correlated with the scale it was intended to detect signal from. RASIQ is a novel PRO instrument, hence there is limited published literature against which our findings can be compared. Due to trial assessments being too far apart, we calculated the SEM using measures of internal consistency reliability across all PROs, which is a further shortcoming given that some researchers would recommend SEM is calculated from a measure of test-test reliability. For all three PRO instruments, our analyses were limited to estimation of thresholds related to improvement. Further work is needed to estimate interpretation thresholds that indicate decline and worsening of symptoms, to confirm the values derived in the current study, and to allow exploration of the potential non-linearity across score distributions (the latter of which was not possible due to insufficient sample size).

Conclusions

This study derived WPMI thresholds for SF-36v2, FACIT-Fatigue, and RASIQ, using multiple anchors. Derivation of WPMI thresholds for these PRO instruments will enable their broader use to assist with evaluation and interpretation of treatment benefit in future RA studies.