Background

The responsiveness of a health-related functional state is an important issue in arthroplasty surgery. Responsiveness is the ability of an instrument to detect clinically significant change in health status and as such reflects its impact on clinical practice over time [1,2,3]. It is well recognized that measurement properties can vary according to the study population of interest. This is particularly true of the generic measures, especially those measuring responsiveness. The decision to use a generic or disease-specific instrument to detect responsiveness will also depend on the study design, objectives, and evaluation of cost-effectiveness [4]. Generic health status measures seek a broad perspective that is not specifically related to the restricted scope of the health-related functional status of a particular disease. Generic measures allow the comparison of health status across different diseases and interventions [4, 5].

Assessing outcomes of hip and knee replacement surgery for both generic and specific measures is enabled by the EQ-5D-3L, the Oxford Hip Score (OHS), and the Oxford Knee Score (OKS). The EQ-5D is a well-known and widely used generic patient-reported outcome questionnaire [6]. The current UK version of EQ-5D-3L was introduced in 1997 as a generic measure of health for clinical and economic assessment [7]. It was designed to describe and value health by providing a single summary index-based value (utility; − 0.59 to 1) representing the overall health-related quality of life by quantifying a preference for the individual’s health state [8]. The questionnaire consists of a self-reported/descriptive system to describe the three-level health problems (no/some/extreme problem) on each five dimension: mobility (i.e. problem in walking), self-care (i.e. problem with self-care), usual activities (e.g. work, study, housework, family, or leisure activities), pain/discomfort, and anxiety/depression.

The OHS and the OKS focus on the disease being studied, allowing greater sensitivity to intervention related-change compared to generic measures [4, 9]. The OHS and the OKS consist of 12 Likert-type response items, which relate to pain and disability experienced over the past 4 weeks [10, 11]. Scores from each item are summed (responses coded from ‘None’ = 4 to ‘Severe’ = 0) providing a range of 0 to 48, with a higher score indicating greater health status [12, 13].

Husted et al. [14] defined the internal responsiveness as the ability of a measure to change over a pre-specified time frame. The external responsiveness was defined as the extent to which changes in measure relate to changes in other measures of health status, and it measures rather the relationship between change in the measure and change in the external standard [14]. The external responsiveness between independent groups and cross-validation between measuring systems were explored in the previous studies. The ability of these instruments to detect responsiveness is required to examine using the paired group specific statistics as previous studies did not specify the internal responsiveness for a single group. The aim of this study is to evaluate the paired data-specific responsiveness of the EQ-5D-3L, the OHS, and the OKS using various analytic methods, and to discuss which analytic methods and instruments should be used for the reporting system in arthroplasty surgery.

Methods

Data sources

Responsiveness was accessed for the population from the NHS patient-reported outcome measures (PROMs) data who have undergone hip or knee surgery in the UK (ref: NIC-392690-F7H2Q). Follow-up was measured 6 months after the hip or knee surgeries. The NHS PROMs linked to Hospital Episodes Statistics (HES) (2009–2015) data recorded the pre- and the 6 months post-operative PROMs outcomes. The outcomes include the EQ-5D-3L and the respective hip and knee Oxford scores for all individuals who underwent hip and knee surgery in England [15].

The inclusion criteria were patients who had not received revision (primary surgery only)Footnote 1 and who had not had previous surgery, using the ‘Q1_PREVIOUS_SURGERY’ question (N = 575,980). In addition, patients who completed both pre- and post-questionnaires were included, using the ‘Q1 and Q2 Complete’ questions (N = 443,262) [16]. For hip surgery, patients who submitted specific data were included for both the pre- and post-operative Oxford questionnaires to derive scores with sufficient procedure, using the ‘HR Q1 and HR Q2 Score Complete’ questions (N = 209,761) and the ‘Q1 and Q2 EQ-5D Health Scale Complete Indicators’ (N = 181,424). The same approach was applied for those undergoing knee surgeries (N = 191,379).

Outcome and predictor variables

The change scores (the difference between the post- and pre-operative scores) of the EQ-5D-3L, the OHS and the OKS were used as the main outcomes, respectively. The pre-operative EQ-5D-3L, OHS, and OKS scores were used as the main predictor variables for the change scores. Patients' age, gender and important clinical exposures, namely, 12 individual comorbidities (heart disease, high blood pressure, stroke, circulation, lung disease, diabetes, kidney, nervous system, liver disease, cancer, depression, arthritis) were used as other prognostic variables.

Transition question

The MCID (minimally clinically important difference), which can be linked to the improvement concept, was calculated using the patients’ self-assessment of the 6 months post-operative outcomes relative to the pre-operation. The MCID allows an estimation of the probability of a relevant improvement in the instrument of an intervention [17]. The assumption of the MCID is that the mean change score needed to obtain a medium or large effect size is clinically meaningful [18]. Clinically meaningful refers to a change that indicates the efficacy of the intervention in domains of a health-related functional status instrument [4]. The MCID can be calculated for the group reflecting level (using the anchor-based transition in which the concept of ‘minimal importance’ is explicitly incorporated) and also for the distribution-based individual level (using the standardized response mean (SRM) applied paired data-specific MCID) [17, 19, 20]. In this paper, a combined approach, firstly, the SRM applied paired data-specific MCID was used to estimate the threshold for improvement, and secondly, patients’ perception of improvement was estimated by the level of the transition in the multivariate regression models (Table 1).

Table 1 The patient-reported success of the surgery in the Oxford hip (or knee) score questionnaire

The NHS PROMs contains the post-operative satisfaction and success questions, and the success question was applied in this study since it is considered more objective than the satisfaction question asking ‘How would you describe the results of your operation? Excellent/Very good/Good/Fair/Poor’.

For the paired data-specific univariate responsiveness, the SRM, the standardized effect size (SES), and the responsiveness index (RI) were calculated.

Univariate responsiveness measures

In the present study, internal responsiveness was investigated focusing on internal standard of an individual using the pre- and post-operation (paired) data and compared as the psychometric property of the EQ-5D-3L, the OHS, and the OKS. The internal responsiveness was assessed by calculating different formula of responsiveness in terms of a critical assessment: the SRM, the SES, and the RI for the univariate statistics.

SRM for the paired data [4, 20,21,22]

$$ \mathrm{The}\ \mathrm{paired}\ \mathrm{data}-\mathrm{specific}\ \mathrm{SRM}:\frac{\left({\mathrm{Mean}}_{\mathrm{change}\ \mathrm{score}}/{\mathrm{SD}}_{\mathrm{change}\ \mathrm{score}}\right)}{\surd 2\times \surd \left(1-r\right)} $$
(1)

where r is a correlation coefficient between the pre- and post-operative scores [4].

The pre- and post-operation data-specific SRM is the ratio between the mean change score and the variability (SD) of that change score within the same group (Meanchange score/SDchange score), and the difference between means for the independent data is standardized (i.e. divided) by a value √2 ×  √ (1 − r) (as large as would be the case were they independent) [4, 21] (The SRM for the independent data is simply Meanchange score/SDchange score between the two groups [20]).

SES for the paired data

The SES was calculated using the patients’ self-assessed transition level, i.e. much better, a little better, about the same, a little worse, and much worse [4].

$$ \mathrm{Standardized}\ \mathrm{Effect}\ \mathrm{Size}\ \left(\mathrm{SES}\right)=\frac{{\mathrm{Mean}}_{\mathrm{pre}-\mathrm{op}.\mathrm{score}}-{\mathrm{Mean}}_{\mathrm{post}-\mathrm{op}.\mathrm{score}\ \left(\mathrm{of}\ \mathrm{the}\ \mathrm{success}\ \mathrm{level}\right)}}{{\mathrm{SD}}_{\mathrm{pre}-\mathrm{op}.\mathrm{score}\ \left(\mathrm{of}\ \mathrm{the}\ \mathrm{success}\ \mathrm{level}\right)}} $$
(2)

RI for the paired data

The RI was proposed as the ratio of average change produced by a treatment to the between subject variability of difference scores in stable subjects [23]. The RI was calculated using the patients’ self-assessed transition-based (i.e. a little better vs. about the same) MCID, assuming the patients’ perception of change over time is meaningful [4, 24].

$$ \mathrm{Responsiveness}\ \mathrm{Index}\ \left(\mathrm{RI}\right)=\frac{{\mathrm{MCID}}_{\mathrm{anchor}-\mathrm{based}}}{{\mathrm{SD}}_{\mathrm{change}\ \mathrm{score}\ \left(\mathrm{of}\ \mathrm{the}\ \mathrm{stable}\ \mathrm{level}\right)}} $$
(3)

where the MCID here is according to a criterion (i.e. the difference in change score between those who perceived a little better vs. about the same)

In addition to the univariate responsiveness measures, the patients’ perception of improvement was estimated using the modelling approach using the Box-Cox regressions based on log-likelihood while adjusting responsiveness with patient characteristics, including age, gender, and 12 individual comorbidities. For the robust analytic approach, the paired data-specific MCID was defined as the threshold for improvement in the models.

Multivariate responsiveness measures

The threshold for improvement with the MCID for the paired data

Cohen introduced the matched pairs effect size [21], which was later renamed the standardized response mean (SRM) by Liang et al. [4, 20].

The paired data-specific MCID (i.e. Meanchange score) applied the SRM [Eq. 1], as a desired effect size [25]:

$$ \mathrm{The}\ \mathrm{paired}\ \mathrm{data}-\mathrm{specific}\ \mathrm{SRM}\ \left[\mathrm{Eq}\ .1\right]\times \surd 2\times \surd \left(1-r\right)\times {\mathrm{SD}}_{\mathrm{change}\ \mathrm{score}} $$
(4)

The independent data MCID, using Cohen’s medium (0.5) or large (0.8) effect size for the independent samples, is Cohen’s d (i.e. 0.5 or 0.8) × √2 ×  √ (1 − r) × SDchange score.

Multivariate responsiveness using the regression models

The percentage improvement based on the paired data-specific MCID [Eq. 4] was examined as multivariate responsiveness of the EQ-5D-3L, the OHS and the OKS to examine which instrument is sensitive to detect the changes of improvement for the paired data. The result was additionally examined by the patients' self-assessed transition level, i.e. much better, a little better, about the same, a little worse, and much worse. The observed and estimated percentage improvements were examined separately where regression approaches were applied, adjusting patient baseline covariates such as age, gender, and comorbidities. Adjusting the covariates is one of the strengths in comparison to the responsiveness statistics described in the previous sections. The 3rd and the 2nd degree Box-Cox regressions based on log-likelihood were fitted to estimate the patients’ perception of improvement. The impact of baseline covariates, i.e. age (as a continuous variable), gender, and individual comorbidities, were examined in total and by the transition level population (Fig. 1).

Fig. 1
figure 1

The OHS and EQ-5D-3L – total population (1, 3) and the transition level (2, 4). Fitted 3rd degree Box-Cox regression lines 1 for the OHS total population and 2 by the patients’ self-assessed transition level. The 2nd degree Box-Cox regression estimates 3 for the EQ-5D-3L total hip surgery population and 4 by the patients’ self-assessed transition level. All the graphs are presented by age group additionally. Colourful dots indicate 50th percentile for each category, and grey dots indicate actual observations. Grey horizontal lines indicate each defined score improvement (e.g. 22 for the OHS and 0.428 for the hip EQ-5D-3L). Percentiles of the EQ-5D-3L show all over disperse patterns by the transition level whereas percentiles of the OHS show disperse patterns in ‘A little worse’ and ‘Much worse’ transition level. Model performance of the OKS and the knee EQ-5D-3L is provided in Supplementary Figure 1

The Box-Cox regression models were selected among other statistical average models (e.g. polynomial regressions) and median-based models (e.g. quantile regressions), after the model diagnostic assessments. The model is robust for a non-normal dependent variable, transforming it into a normal shape. The observational and estimated percentage improvements for the average population were calculated to examine if the instrument has a good discriminative ability. The individual level post-operative scores were modelled as a function of the transformed variables pre-operative linear, quadratic, and cubic terms and of the untransformed age, gender, and individual comorbidities. In comparison to the models with only pre-operative score terms, circulation and depression (which chi-squared statistics are greater than 2000 in the models and coefficients are significantly large, i.e. greater than absolute value 200) were selected to be adjusted for the hip outcomes. Circulation, diabetes, and depression were selected for the knee outcomes based on the same criteria.

The 3rd degree left-hand-side-only model obtaining the maximum likelihood estimates is as below for the OHS:

$$ {y}_i^{\theta }={\beta}_0+{\beta}_1{x}_i+{\beta}_2{x}_i^2+{\beta}_3{x}_i^3+{\gamma}_1{z}_{1i}+{\gamma}_2{z}_{2i}+{\gamma}_3{z}_{3i}+{\gamma}_4{z}_{4i}+{\varepsilon}_i $$
(5)

where ε~N(0, σ2). y indicates the changed-operative score, and x indicates pre-operative score. y is subject to a Box-Cox transform with parameter θ. z1, z2, z3 are untransformed age, gender, circulation, and depression [26].

Results

Demographics

In total, 181,423 had hip replacement surgeries; over half (N = 106,493; 59%) were female with ages ranging from 13 to 100 years (SD 10.5; male, 15–99, SD 10.4), with a mean age of 68.6 years (male, 67.2 years). At baseline, of the total, 14% (N = 24,945) patients reported no comorbidity, 38.2% (N = 69,249) reported that they have one comorbidity, and 17.8% (N = 3234) have more than three comorbidities. 5.4% (N = 9866) reported circulation, diabetes 8.7% (N = 15,816), and depression 7.3% (N = 13,252).

For the knee replacement population, over half (N = 107,127; 56%) were female with ages ranging from 18 to 99 years (SD 9.1; male, 16–102, SD 8.6), with a mean age of 69.3 years (male, 69.3 years). At baseline, of the total, 9.3% (N = 17,712) patients reported no comorbidity, 33.3% (N = 63,804) reported that they have one comorbidity, and 23.6% (N = 45,200) have more than three comorbidities. Seven percent (N = 13,438) have reported circulation, diabetes 12.4% (N = 23,696), and depression 8.3% (N = 15,823) (Table 2).

Table 2 Baseline covariates

Transition level

For the hip replacement surgery population, a great number of 155,899 (85.9%) patients answered much better. 15,565 (8.6%) patients answered a little better. Relatively smaller number of patients answered about the same 3891 (2.1%), a little worse 2382 (1.3%), and much worse 1633 (0.9%). For the knee replacement surgery population, 138,407 (72.3%) and 31,650 (16.5%) patients answered much better and a little better, respectively. 8985 (4.7%) patients answered about the same. 7029 (3.7%) patients answered a little worse and 4610 (2.4%) patients answered much worse (Table 3; Supplementary Table 1).

Table 3 The transition question (change score)

The Spearman’s rank correlation coefficients for the pre- and post-operative scores, r, are provided by the transition level in Table 4. The large correlations between of the pre- and post-operative scores are observed in patients with the transition level of about the same, a little worse, and much worse.

Table 4 Spearman’s rank correlation coefficients (95% CIs) for the change (pre- and post-operative) scores

Univariate responsiveness measures for the paired data

The OHS and the OKS showed great univariate responsiveness in total, i.e. SRM [Eq. 1], SES [Eq. 2], and RI [Eq. 3] in total: 1.8, 2.8, and 0.6 (~ 0.7) in the OHS and 1.4, 2.5, and 0.7 in the OKS. In addition, the OHS and the OKS showed distinctive differences in the SRM [Eq. 1] by the 3-level transition, in particular, a little better vs. about the same vs. much worse: 1.5 (~ 1.6) vs. 0.8 (~ 0.9) vs. 0.3 (~ 0.4) in the OHS and 1.5 vs. 0.8 (~ 0.9) vs. 0.3 (~ 0.4) in the OKS. There was little difference among the 3-level transition for the SES: 1.7 vs. 1.3 (~ 1.4) vs. 1 (~ 1.1) in the OHS and 1.7 vs. 1.2 vs. 1 in the OKS (Tables 5 and 6).

Table 5 Hip – the SRM, SES, and RI (with 95% CIs) for the OHS and the EQ-5D-3L (by the transition)
Table 6 Knee – the SRM, SES, and RI (with 95% CIs) for the OKS and the EQ-5D-3L (by the transition)

The univariate responsiveness in total for the generic instrument EQ-5D-3L were 1.1, 1.6, and 0.3 (~ 0.4) for the hip and 0.8 (~ 0.9), 1.3, and 0.3 for the knee replacement. The SRMs [Eq. 1] by the 3-level transition were 0.8 vs. 0.5 vs. 0.1 (~ 0.2) for the hip and 0.7 (~ 0.8) vs. 0.4 (~ 0.5) vs. 0.1 (~ 0.2) for the knee replacement. The SES values were similar to each other among the 3-level transition: 1.4 vs. 1.3 vs. 1.1 for the hip and 1.2 vs. 1.3 vs. 1.2 for the knee replacement.

The RI [Eq. 3] was calculated in total only as the calculation incorporates with the 2-level transition (i.e. a little better vs. about the same) in it. The RIs [Eq. 3] in total were 0.6 (~ 0.7) in the OHS and 0.7 in the OKS, which are moderate practical effects by Cohen’s thresholds (i.e. > 0.8 large, 0.5 to 0.8 moderate, and < 0.5 small) [21, 27]. The RIs [Eq. 3] in total for EQ-5D-3L showed negligible practical effects, 0.3 (~ 0.4) for the hip and 0.3 for the knee replacement. The SRM [Eq. 1] and SES [Eq. 2] can be interpreted similarly. The SRM [Eq. 1] and SES [Eq. 2] of ‘A little better’ in the OHS were 1.6 and 1.7, respectively. Both can be interpreted as a crucial difference in the ‘successful’ percentage in each of the two groups (r) of 0.62 [28]. The SRM [Eq. 1] and SES [Eq. 2] of ‘A little better’ in the EQ-5D-3L were 0.8 and 1.4, respectively, which can be interpreted as moderate and crucial differences in the ‘successful’ percentage in each of the two groups (r) of 0.37 and 0.57 [28]. This implies the SRM [Eq. 1] shows a good discriminative ability for the different severities in comparison to the SES [Eq. 2], and EQ-5D-3L is less responsive in comparison to the OHS.

The paired data-specific MCID as the threshold for improvement

The paired data-specific MCID [Eq. 4] was calculated, applying the SRM [Eq. 1] as a desired ES. Multivariate responsiveness was examined using the defined capacity of benefit score as improvement (i.e. 22 for the OHS, and 0.428 for the hip EQ-5D-3L; 16 for the OKS and 0.309 for the knee EQ-5D-3L)Footnote 2, adjusting covariates. Various ways to assess the improvement for the independent data are presented in Supplementary Table 2. Those scores are smaller than the capacity of benefit scores for the paired data. The SRM applied MCIDs for the independent data are 6 for the OHS, and 0.196 for the hip EQ-5D-3L, using Cohen’s medium (0.5) effect size. The MDCs (minimal detectable changes, defined as the minimal change that falls beyond the measurement error in the measurement score [29]) are 6 for the OHS and 0.234 for the hip EQ-5D-3L, with ICC 0.9. The anchor-based MCIDs are 9 for the OHS, and 0.101 for the hip EQ-5D-3L, using the short distance. The mean change scores using the anchor are 6 for the OHS, and 0.106 for the hip EQ-5D-3L. A greater capacity of benefit score is required for the paired data in comparison to the independent data, to detect how likely the surgery is to distinguish an actual effect from one of chance in the pre- and post-operative outcomes.

Multivariate responsiveness measures – observed and predicted improvement

The percentage improvements based on patients’ perceptions were high in the OHS and the OKS (Tables 7 and 8). The percentages of the observed (predicted) total improvement were 51 (54)% in the OHS and 73 (58)% in the OKS. In addition, the OHS and the OKS showed distinctive percentage differences by the 3-level transition, i.e. a little better vs. about the same vs. a little worse. As an example, the observed percentages of the 3-level transition were 10% vs. 4% vs. 1% in the OHS and 21% vs. 6% vs. 3% in the OKS. The percentages of the observed (predicted) total improvement in the generic instrument EQ-5D-3L were 44 (48)% for the hip and 42 (44)% for the knee replacement population. The observed (predicted) percentages of the 3-level transition in the EQ-5D-3L were 39 (41)% vs. 29 (11)% vs. 21 (4)% for the hip and 39 (45)% vs. 32 (36)% vs. 26 (14)% for the knee replacement population.

Table 7 Hip – patients’ perception of improvement (%) (using the paired data-specific MCID [Eq. 4])
Table 8 Knee – patients’ perception of improvement (%) (using the paired data-specific MCID [Eq. 4])

The observed (predicted) percentage improvements applied the Cohen’s ES (0.5 and 0.8) are additionally provided in Supplementary Table 3 and 4 for the independent data. The observed (predicted) percentages for the medium improvement were 93 (99)% in the OHS, and 85 (98)% in the OKS. The observed (predicted) percentage improvements in the EQ-5D-3L were 75 (74)% for the hip and 60 (58)% for the knee replacement population. The observed (predicted) percentages of the 3-level transition were 78 (90)% vs. 52 (57)% vs. 34 (19)% in the OHS, and 73 (85)% vs. 46 (42)% vs. 29 (8)% in the OKS. The observed (predicted) percentages of the 3-level transition in the EQ-5D-3L were 50 (52)% vs. 38 (50)% vs. 29 (41)% for the hip and 45 (48)% vs. 36 (47)% vs. 29 (42)% for the knee replacement population.

A great number of patients (86% for hip and 72% for knee) answered much better for success of the surgery (Table 2). In addition, the greater capacity of benefit score was applied for the calculation of the paired data-specific percentage improvement. Therefore, overall percentages (%) of patients’ perception of improvement are lower in comparison to the improvement for the independent data. There were much distinctive percentage differences by the transition level when the paired data-specific capacity of benefit score was applied for the calculation.

Model performance

The area under the ROC curve (AUC) with 95% binomial exact confidence intervals was calculated to examine discriminative ability with each MCID assuming as the true improvement status, using the patient rating instruments, i.e. OHS, OKS, and EQ-5D-3L (Tables 7 and 8) for the observational data.

Internal validation

Internal validation was performed by examining what sensitivity there is within the dataset to the period: NHS PROMs linked to HES 2009–2011 vs. 2012–2015. There was no significant sensitivity by two-period (Supplementary Figure 2).

Discussion

The paired data-specific sensitivity of the EQ-5D-3L, the OHS and the OKS were investigated to detect changes in the health state over time for the population who underwent hip or knee surgeries in the UK. To ensure accuracy of the health status and instrument evaluation in hip and/or knee replacement surgery, the paired data-specific SRM was examined for the univariate responsiveness. In addition, the SES and the RI were calculated using the patients' self-assessed transition. Multiple responsiveness metrics were applied, including a robust modelling approach that adjusted significant baseline covariates to estimate percentage improvements. From the modelling approach, the paired data-specific observed (and the predicted) percentages of improvement were distinctive by the transition level (Tables 7 and 8). The multivariate modelling method provided robust responsiveness statistics in terms of adjusting the patient demographic information and comorbidities. Responsiveness from the models was interpretable with a percentage scale of improvement.

A greater capacity of benefit score is applied to a calculation of improvement for a paired data. Therefore, overall percentages (%) of patients’ perception of improvement are relatively low. The missing cases of predicted improvement by certain transition levels are inevitable for the Oxford questionnaires which have ceiling effects where a greater study population answered much better after the surgery.

Disease-specific and generic instruments are both available in the PROMS data in the UK, and they showed reasonable responsiveness as a health-related instrument that measures functional state. A previous study using the NHS patient-reported outcome measures (PROMs) supports moderate correlations (0.3 to 0.6) between the EQ-5D-3L and other measures of patient-reported health changes, including the OHS and the OKS [30]. Nonetheless, there has been a lack of evidence to support the ability to discriminate. In terms of detecting clinically significant changes in arthroplasty surgery, although it has not been firmly fixed yet, a number of studies indicated that disease-specific instruments are more responsive than generic instruments [4, 31,32,33,34,35]. The present study showed that, although the responsiveness was greater and more distinctive in the disease-specific instruments, the responsiveness of the EQ-5D-3L for hip and knee surgery are reasonably good. The EQ-5D would be useful in terms of short completion time and good validity [3]. Nevertheless, it may not be sufficiently sensitive to be used solely in hip and/or knee replacement surgery, either to discriminate between cases of differing severities by a transition question or to detect the changes in severity or functional status over time [21].

The accurate identification and the early stage of stratification of patients undergoing hip and/or knee replacement are one of the greatest unmet needs. A robust and precise measurement instrument will be effective in the management of arthroplasty surgery for particular group of patients. The OHS and the OKS have been provided evidence that the instruments are able to contribute to the better management of arthroplasty surgery. In general, arthroplasty surgery is based on an individual level in terms of a patient’s expectations, symptoms, diagnoses, and degree of pain. Although the excellence of the Oxford questionnaires over other patient-reported questionnaires was examined, the Oxford questionnaires have a ceiling effect, and the threshold levels are always a trade-off between sensitivity and specificity. Moreover, the current version of the OHS or the OKS does not contain a psychological measurement such as depression or anxiety which is also important in health outcome. Further investigation is required about their potential roles of clinical or trial use, cost-effectiveness, and their effects on referral patterns.

Strengths and limitations

The strength of this study includes using a large cohort data linked to HES on both hip and knee replacement surgeries that provided enough power to support the research outcomes. Although the sample size is large enough to validate the improvement values using complete-case analysis, validation by an external data set was not conducted. The study design may be suboptimal compared to a well-blinded randomized clinical trial. Additional care may be required in the interpretation of patients’ socio-demographics, clinical/treatment and other unobserved covariates that may not be adjusted.

A secondary transition was not used in the study. The NHS PROMs data contains only one-point transition measurement (6 months post-operation) and a more objective point assessment may need to be considered [36]. The mean change score using a patient-reported transition (i.e. an anchor approach) has a limitation, in that the one-point transition measurement relies on a patient’s memory in global health status, and it could be a more subjective change measurement in contrast with each of the pre- and post- point assessments [36]. In addition, the measurement errors should account for repeatedly measured patient-reported outcomes. There will be several ways to control the errors such as use of the MDC approach (i.e. the threshold for improvement adjusted for measurement error) or applying advance statistical inference approaches such as Bayesian models with computational methods. Potential limitations or difficulties would be the fact that it is not easy to precisely estimate a percentage improvement using the model fitting with the EQ-5D-3L due to the nature of the real number scales (− 0.59 to 1), and the scale is very dispersed (Supplementary Figure 3).

Conclusions

The paired data-specific responsiveness was investigated in the population from the NHS PROMs data who underwent hip or knee surgery in the UK. The OHS and the OKS showed good discriminative abilities in the clinically significant changes, and the EQ-5D-3L also showed comparatively moderate responsiveness. Using the paired data-specific capacity of benefit scores, the OHS and the OKS showed distinctive differences of clinically significant chances by the level of the transition, in particular for the 3-level transition, i.e. a little better, about the same, and a little worse. This is useful in clinical practice as rationale for access to surgery at the individual-patient level. The study finding supports the idea of using a precise estimation of improvement and appropriate instruments in arthroplasty surgery. It seems that a generic measure would be beneficial to use along with the disease-specific instruments in terms of cross-validation unless an enhanced instrument has been developed, or a specific reason is required in the reporting system.