Background

Rheumatoid arthritis (RA) is a chronic and unpredictable disorder that can cause persistent joint pain, joint damage and long-term disability (especially in the hands and feet). The economic cost of RA is substantial for individual patients, health services and society as a whole [1]. Patients with poor and declining function from their diagnosis of RA generate elevated medical care costs [2]. A report by the National Rheumatoid Arthritis Society (NRAS) in 2010 found that the overall cost of RA to the UK economy was almost £8 billion per annum with National Health Service (NHS) expenditure totalling approximately £700 million per annum [3].

Patient reported outcome measures (PROMs) are increasingly used to measure health-related quality of life (HRQoL) from the patient perspective. PROMs have also increasingly been used in randomised controlled trials (RCTs) and other evaluative studies to measure the benefits of interventions in terms of health status or HRQoL. PROMs can be condition-specific or generic. Generic PROMs can either be preference-based (patient responses are generally used to generate profile scores, which are converted into index scores based on preferences for a given health state) or non-preference based (patient responses are generally summed to provide a score) with examples of the former such as the EQ-5D and SF-6D offering several advantages [4, 5]. The main advantages of generic preference-based PROMs are their ease of administration and high rate of completion, the generalizability of their results and their ability to meet the requirements of decision-making bodies, such as the National Institute for Health and Care Excellence (NICE) in England and Wales, concerned with cost-effectiveness comparisons [6]. Furthermore, preference-based outcome measures can incorporate the impact of treatment or ill health within multidimensional scales and can be combined with data on survival in the form of quality-adjusted life years (QALYs) [4]. Given the increasing diffusion of PROMS within evaluative research and the increasing use of the outputs of preference-based outcome measures within decision-making processes, it is important to establish the relative merits of alternative PROMS in specific clinical and research contexts.

To be useful in assessing HRQoL in individuals with RA of the hand, PROMS should satisfy a range of psychometric properties. The psychometric literature has developed a number of criteria to judge the performance of different instruments, key to which are acceptability, validity, reliability, interpretability and responsiveness to changes in health state. Although the construct validity of the EQ-5D has been investigated in the context of RA [7] and the responsiveness of the EQ-5D and SF-6D in patients with early arthritis [8], only one previous study has compared the psychometric properties of generic HRQoL measures (HUI2, HUI3, SF-6D and EQ-5D) with a disease-specific instrument (Rheumatoid Arthritis Quality of Life Scale, RAQoL) in the RA context [9]. The Michigan Hand Outcome Questionnaire (MHQ) is a well-established measure for patients with RA and widely used in clinical trials [10]. It has previously been compared to the Health Assessment Questionnaire (HAQ) in patients with RA [11] and also compared to the SF-12 in a study including patients with thumb osteoarthritis [12]. However, no studies have, to our knowledge, have so far investigated the performance of the MHQ in relation to preference-based measures..

The current study aims to fill this research gap by investigating the psychometric properties and performance of generic PROMs compared to the MHQ for patients with RA of the hand. It is anticipated that the results will provide evidence for the use of generic (EQ-5D, SF-12, SF-6D) and condition-specific PROMs in future research studies, including economic evaluations, related to RA.

Methods

SARAH trial

The SARAH trial was a pragmatic, multi-centre, randomised controlled trial conducted with 1 year follow-up. 488 participants with RA who had pain and dysfunction of the hands and/or wrists were randomised to either a tailored exercise programme in addition to usual care (n = 246) or to usual care alone (n = 242).

The primary method of data capture was face-to-face research clinic appointment.. Baseline and follow-up data at 4 and 12 months after randomisation was collected, including the MHQ score, EQ-5D utility, EQ-5D VAS, SF-12 and SF-5D at each of these time points.. Further details about the SARAH trial, its sampling procedures, methodology, outcome measures and responses rates are reported in full elsewhere [13]. Since we were primarily interested in the properties of the outcome measures used, rather than any evaluation of the interventions in the trial, all SARAH participants were included in the analyses reported here, regardless of trial allocation. The SARAH trial was approved by the Oxford C Multicentre Research Ethics Committee in June 2008.

Patient reported outcome measures

MHQ

The primary outcome measure for the SARAH study was the MHQ overall hand function score at 12 months. The MHQ is a common hand-specific outcome measurement tool for patients with chronic hand conditions [14]. The MHQ has been validated for use in a wide range of patient samples. More specifically, it has been used in carpal tunnel syndrome [15, 16], distal radius fracture [17], reconstruction [18, 19] and arthroplasty in RA [20, 21]. The MHQ is appropriate for use in RA populations due to the comprehensive information gathered on functional abilities as well as patient satisfaction, pain and hand appearance. It has been utilised to assess disability and it is often an outcome measure for clinical trials in RA. It measures patient perception of hand function, appearance, pain, and satisfaction. It is intended for people with hand or wrist conditions or injuries [14]. It can be used to measure a patient’s general hand function, or can be used to assess changes in hand function over time, e.g. pre- and post-operation. It consists of 37 items and 6 subscales: overall hand function, activities of daily living (ADL), pain, work performance, aesthetics, and patient satisfaction with hand function. Scores range from 0 to 100, with higher scores indicating better performance, except for the pain scale. For the pain scale, a higher score indicates more pain [14].

EQ-5D

The EQ-5D-3L [22] (hereafter EQ-5D for brevity) comprises two components that assess health status on the day of completion. The first component is a self-reported descriptive system with five health dimensions (mobility, self-care, pain/discomfort, usual activities, and anxiety/depression) each divided into three different levels, namely no problems, some or moderate problems and severe or extreme problems. Responses to the descriptive system are generally valued using the time-trade method. For the purposes of this study, we applied the York A1 (Dolan) tariff set derived from a survey of the UK general population (n = 3337), which used the time trade-off valuation method to estimate utility scores for a subset of 45 EQ-5D health states, with the remainder of the EQ-5D health states subsequently valued through the estimation of a multivariate model [23]. Resulting utility scores range from -0.59 to 1.0, with 0 representing death and 1.0 representing full health, with some health states considered worse than death (<0). A further component of the EQ-5D consists of a visual analogue scale (VAS), which asks people to rate their current overall health on a scale from 0 (the worst health state they can imagine) to 100 (the best health state they can imagine).

SF-12 and SF-6D

The SF-12 consists of 12 items that assess 8 dimensions of health: physical functioning, role-physical, bodily pain, general health, vitality, social functioning, role-emotional and mental health. The SF-12 was scored as described by Ware [24]. The SF-12 measures various aspects of physical and mental health from which physical and mental summary scores can be calculated. The Physical Component Summary Score (PCS) and Mental Health Component Score (MCS) are both standardised to have a mean of 50 and a standard deviation of 10 [13].

A derivative of the SF-12 is the SF-6D, which is a multi-attribute utility measure composed of 6 dimensions (physical functioning, role limitation, social functioning, pain, energy, mental health), each of which has between four to six levels. The SF-6D generates 18,000 possible health states. To estimate health utilities for the SF-6D, we applied an algorithm developed by Brazier and colleagues [5] who surveyed a representative sample of the UK general population using the standard gamble technique. Utility values for SF-6D health states can fall between 0.30 and 1.0, where 1.0 represents full health and 0 represents death.

Indicators of RA pain

In the SARAH trial, RA pain was measured using the Troublesomeness questionnaire (range 0-20, higher score indicates greater pain) [25] at baseline, and at 4 and 12 months post-randomisation.

Statistical analysis

We followed the definitions and recommendations from the COSMIN (Consensus-based Standards for the selection of health Measurement Instruments) checklist [26], alongside a previously published checklist of assessment criteria for PROMs [27], when analysing the psychometric properties of the MHQ, EQ-5D (utility) (preference-based responses to the EQ-5D descriptive system), EQ-5D (VAS), SF-12 and SF-6D in the SARAH trial. Statistical analysis was conducted using STATA version 13.0 (Stata Corporation, Texas, USA) [28].

Acceptability

A PROM must be practical and acceptable to the population that will be completing the instrument and also represent the interests and perspectives of many different individuals associated with the PROM. The acceptability of the different study PROMs was measured using completion rates at baseline and each of the two follow up time points (4 and 12 months post randomisation) [29, 30].

Validity

The validation process for PROMs aims to establish whether a measure is useful in reaching the objective it has been developed for. The overall validity of an instrument is composed of a number of important components, such as content, construct and criterion validity. From a theoretical point of view a perfect validation process would compare the outcomes of the examined instrument to an external “gold standard”. However, for a number of abstract constructs such as pain, happiness or HRQoL, an external gold standard does not exist. This has led to the development of indirect empirical tests of validity [31]. Although many different indirect empirical tests of validity have been proposed in the social science literature, we focussed on construct and convergent validity..

Construct validity concerns the degree to which the scores of an instrument are an adequate reflection of the dimensionality of the construct to be measured [26]. The construct validity of the instruments used in the SARAH trial was assessed by the “known groups” approach [32]. In known groups validity, we take pre-specified groups where we would expect there to be a difference in health status, and thus instrument scores. The different scores between groups for alternative measures can then be compared to see if there is a pattern in the sensitivity to these expected differences [33]. Independent samples t-tests were performed to estimate the ability of each summary score, i.e. MHQ), EQ-5D (utility), EQ-5D (VAS), SF-12 (PCS), SF-12 (MCS) and SF-6D utility, to discriminate between groups with different RA severity at baseline. RA pain severity was measured using the Troublesomeness questionnaire () [25].We classified individuals according to their pain troublesomeness using a 30% threshold (low: pain troublesomeness score <30%; high: pain troublesomeness ≥30%) [25]. To assess convergent (discriminant) validity we assessed the relationship between continuous clinical (MHQ)) and health-related utility measures (EQ-5D utility, SF-6D utility) at baseline with the Spearman’s rank correlation coefficient. A correlation coefficient between 0.9 and 1.0 suggests that variables can be considered very highly correlated. Correlation coefficients between 0.7 and 0.9 indicate variables that can be considered highly correlated. Correlation coefficients between 0.5 and 0.7 suggests that variables can be considered moderately correlated, whilst correlation coefficients between 0.3 and 0.5 indicate variables that have a low correlation [34].

Reliability

In the current study, we evaluated one type of reliability, internal consistency for the MHQ EQ-5D(utility), EQ-5D (VAS), SF-12 (PCS), SF-12 (MCS) and SF-6D at baseline. Internal consistency reliability measures the homogeneity of the items comprising a scale; that is, whether the items in the same scale measure the same underlying concept. We used the Cronbach’s alpha (α) coefficient to express internal consistency. Cronbach’s alphas can range from 0 to 1.0, where 1.0 indicates perfect internal consistency. Generally, consistency is considered unacceptable for α <0.5, poor for 0.6 > α ≥ 0.5, questionable for 0.7 > α ≥ 0.6, acceptable for 0.8 > α ≥ 0.7 and good for 0.9 > α ≥ 0.8 [32]. Values > 0.90 indicate redundancy [34].

Interpretability

Interpretability is defined as the degree to which one can assign qualitative meaning to an instrument’s quantitative scores or change in scores [26]. Although not generally considered as a psychometric property, it is an important characteristic of a measurement instrument. The interpetability of the different study PROMs was measured using the minimal important difference (MID), which reflects the smallest amount of change in a score that is meaningful to a patient [36].

Responsiveness

Responsiveness considers whether the changes registered by a measure over time correspond to those expected based on an external reference measure of health [35]. We made use of two different reference measures to estimate the responsiveness of all study PROMs. The first referent was participant self-rated improvement in their hands and wrists, which used a seven- point Likert scale asking whether they had completely recovered, were much improved, slightly improved, showed no change, were slightly worse, much worse or vastly worse. These were collapsed into three categories, namely improved, no change and worsened for the purposes of these analyses. The second referent was a self-rated measure of benefit and satisfaction from trial treatments that assessed whether participants experienced substantial benefit, moderate benefit, no benefit, moderate harm or substantial harm. These were collapsed into three categories, namely benefit, no benefit and harm for the purpose of these analyses. The estimate of responsiveness was measured from baseline to 4 months and 4 months to 12 months for the self-reported hand and wrist functioning measure, and from baseline to 4 months and baseline to 12 months for the measure of benefit and satisfaction from trial treatments. A number of statistical tests were employed for this purpose, including the Effect Size (ES) and Standardize Response Mean (SRM) The ES can be defined as the change in mean score divided by the standard deviation of the instrument scores at baseline. The SRM divides the mean change in score by the standard deviation of individuals’ change in score. Changes in both were considered large when the ES and SRM were greater than 0.8, moderate when they were between 0.79 and 0.5 and small when they were between 0.49 and 0.2 [26].

Results

A total of 488 participants were recruited into the SARAH trial, 452 (92%) and 438 (89%) of whom were followed up at 4 and 12 months, respectively. At inclusion in the study there were 76% females and the mean age was 62.4 years (Table 1).

Table 1 Baseline characteristics and response rates to each outcome measure at each time point (n = 488)

Acceptability

Response rates for each outcome measure at baseline, and 4 and 12 months follow up are reported in Table 1. Response rates across all study time points ranged from 76.0% (SF-12) to 99.1% (EQ-5D utility). The SF-6D response rate (80.1%) was slightly higher than for both SF-12 subscales (76.0%). At baseline there were no missing data for the MHQ and the EQ-5D (VAS).

Validity

Table 2 shows the results of the known–groups validity tests at baseline. Although all differences between low and high pain troublesomeness scale scores were statistically significant at the 5% significance level, not all instruments discriminated well between patients who had RA pain versus those who did not (as depicted by the pain troublesomeness scale). The MHQ and SF-12 (PCS) had large effect sizes (>0.8), while the remainder of the instruments had medium effect sizes.

Table 2 Known- groups (construct) validity effect sizes for the pain troublesomeness (baseline data)

Table 3 presents the spearman’s rho correlation coefficients between the various instruments with all them being statistically significant at the 1% level of significance. Our results suggest that the MHQ correlated moderately with the SF-6D (ρ = 0.63) and EQ-5D (ρ = 0.65).

Table 3 Convergent (discriminant)validity. Mutitrait-multimethod (MTMM) correlation matrix illustrating the correlation of the different measures at baseline, missing data excluded pairwise for each comparison (n = 488)

Reliability

The internal consistency of the study outcome measures as estimated by the Cronbach’s alpha coefficient at baseline (Table 4) was similar across all scales and above the threshold of 0.70 recommended for broader use in clinical research [10].

Table 4 Average inter-item correlation and Cronbach’s Alpha scores for study outcome measures at baseline

Responsiveness and Interpretability

Mean scores for each PROM at baseline and 4 months follow-up (Table 5) and at 4 and 12 month follow-up (Appendix 1) are shown for the self-reported hand and wrist functioning measure, which was used to estimate responsiveness; changes over time and ES and SRM estimates are also presented.

Table 5 Responsiveness of measures over time to self-reported hand and wrist functioning; baseline to 4 months

There was a statistically significant change in MHQ score for patients reporting improved hand and wrist functioning (Δ = 13.13) between baseline and 4 months. This was also the case for the EQ-5D (utility) (Δ = 0.11) and EQ-5D (VAS) (Δ = 7.5) (Table 5). Minimally important differences (MID) [interpretability] for each PROM varied; whilst a meaningful alteration in MHQ score required a large change over the study period, all other measures required smaller numerical changes. Table 5 summarises the ESs and SRMs for all measures and shows that the MHQ score [(ES = 0.79 95% CI: -1.64 to 3.32) and SRM = 0.56 (95% CI: -1.88 to 3.00)] was highly responsive to capturing improvements in self-reported hand and wrist function between baseline and 4 months. ESs and SRMs for EQ-5D (utility) and EQ-5D (VAS) were larger for the “improved” changes compared to the other categories. Overall, there were no consistent patterns at detecting changes to hand and wrist functioning between baseline and 4 months. The same analysis was repeated for changes between four and 12 months (Appendix 1). Results suggest more consistent patterns between all instruments with ESs and SRMs indicating less than moderate responsiveness to capturing improvement and worsening from 4 to 12 months. Estimates ranged from 0.35 (MHQ) to -0.34 (EQ-5D VAS).

The results for the analyses that used the perceived benefit/harm measure are summarised in Appendices 2 and 3 for the alternative follow-up periods and suggest that all instruments show small responsiveness (ES and SRM < 0.5) to perceived benefit/harm from the treatments between baseline and 4 months. The MHQ score as highly responsive to assessments of benefit or harm over the 12 month follow-up period (ES > 0.8].

Discussion

This study compared the psychometric properties of generic HRQoL measures [EQ-5D (utility), EQ-5D (VAS), SF-12 (PCS), SF-12 (MCS), SF-6D (utility)] and a condition-specific (MHQ) PROM in a large sample of participants with RA of the hand. We examined the acceptability, construct validity, convergent validity, internal consistency, interpretability and responsiveness of these measures, as defined by the COSMIN checklist [26] and the checklist of assessment criteria published by Brazier and colleagues [27]. The reliability and validity of the MHQ has previously been established [13]. This study is the first to estimate the validity of the MHQ against an objective measures of pain troublesomeness. It further compared the MHQ with generic HRQoL instruments (EQ-5D, SF-12, SF-6D) to understand the strengths and weaknesses of each of these instruments in studies of RA of the hand.

High response rates to all measures included in the study, particularly for the EQ-5D, indicate the high acceptance of these instruments, by the individuals who completed and responded to the questionnaires, and their suitability for self-administration. The high response rates over the course of the SARAH study were achieved through completing measures face to face at a research clinic and also through follow-up mechanisms that included reminders sent to study participants by post or by telephone. The order in which measures were presented in the self-completion patient questionnaires (MHQ, EQ-5D, SF-12) might have influenced the response rates. Our findings from analyses of construct validity generally support the ability of all the measures used to discriminate between different levels of RA severity of the hands. Mean HRQoL or utility scores for all measures were significantly different between participants experiencing differing severity of RA pain. This finding is not in line with the study by Marra et al. [9], which found that of the Health Utilities Index 2 and 3 (HUI2, HUI3), EQ-5D, SF-6D, RA Quality of Life Questionnaire (RAQoL) and the Health Assessment Questionnaire (HAQ) only EQ-5D and SF-6D scores significantly differed by level of RA severity. Strong associations were observed in our study between the MHQ score and RA pain severity, followed in strength of association with RA pain severity by the EQ-5D (utility) and the SF-6D. In addition, our results suggest that the MHQ was highly responsive to assessments of benefit or harm over the 12 month follow-up period within the SARAH trial. Adams et al. [36] previously concluded that the EQ-5D is more responsive to deterioration in RA pain than the SF-6D and the SF-6D is more responsive to RA improvement than the EQ-5D. The physical component of the SF-12 had more consistent construct validity in our study than the mental health component of the measure, which is in agreement with the findings of Kosinskli et al. [37] in their validation study of the SF-36. Our findings with regards to the ability of the physical and mental health components of the SF-12 to discriminate between RA pain severity are also in agreement with the study by Linde et al. [38].

Our convergent validity analysis indicated that the MHQ score correlates most strongly with the EQ-5D (utility) score. The low level of correlation found between the SF-12 (MCS) and MHQ, and between the remaining PROMs, indicates that their respective constructs may be non-overlapping. The physical component of the SF-12 demonstrated the highest degree of inter-relatedness, especially with the EQ-5D. All measures under investigation displayed acceptable internal consistency as measured by Cronbach’s alpha values (>0.70).

The large ESs and SRMs for the MHQ indicates that this measure is very responsive at detecting changes in self-reported hand and wrist functioning; its responsiveness was followed by that for the EQ-5D (utility), SF-12 (PCS) and SF-6D. The SF-12 (MCS) could only moderately detect such changes. Overall, the measures, particularly the MHQ and EQ-5D, were more responsive at detecting improvement in external measures of health rather than worsening or no change. Our condition-specific instrument (MHQ) performed better at detecting patient-reported changes in external measures of health compared to the generic measures. This finding contradicts Linde and colleagues [38] who found no superiority in responsiveness of RA clinical measures (Rheumatoid Arthritis Quality of Life Scale (RAQoL) and Health Assessment Questionnaire (HAQ)) compared to the EQ-5D.

A possible weakness of this study is that due to data limitations, we were unable to assess the criterion validity, content validity and test-retest reliability of the measures. Also, some of the limitations of the analytical strategy are related to the use of total scores for many of the PROMs instead of using weighted methods (through, for example, factor analysis or Item Response Theory) [39].

Despite the study limitations, it should help to inform clinical researchers and health economists in this field in their selection of PROMs for use in their clinical and health economic evaluations. More specifically, the precision of trials in the context of RA where health outcomes are measured through a single instrument will be enhanced by evidence surrounding the psychometric properties of the alternative outcome measures evaluated in our study.

Conclusions

In conclusion, the instruments evaluated in this study displayed varying psychometric properties in the context of RA of the hand. Our results extend beyond those of Harrison et al. [40], who previously proposed that at least one measure of HRQoL is included in studies of inflammatory arthritis. Our study revealed that of the study measures, the MHQ was most responsive at detecting change in indicators of RA pain severity, whilst the EQ-5D offered advantages over the SF-12 and its preference-based derivative (SF-6D) with respect to some psychometric properties. However, the selection of a preferred instrument in evaluative studies should ultimately depend on the relative importance placed on individual psychometric properties and the importance placed on generation of health utilities for economic evaluation purposes. Future studies are also needed to establish the generalizability of our findings for different hand conditions and different hand practices.