Introduction

A problem often encountered in economic evaluations of healthcare interventions is deriving quality-adjusted life-years (QALYs) from disease-specific instruments, as these health-related quality of life (HRQoL) measures do not have utility scoring systems. A recently proposed solution, which has been gaining increasing attention, is “mapping” between instruments, i.e., regressing preference-based utility indices against dimension or item scores of disease-specific measures [1]. Various pairs of instruments have been used, with EQ-5D being the target measure in more than half the mapping studies published or produced after 2000 [2]. Other target measures have been the HUI2/HUI3, the SF-6D and the 15D. Disease-specific measures that have been used cover areas such as inflammatory bowel disease [3], obesity [4], angina [5], dental conditions [6], intermittent claudication [7], cancer [810] and rheumatoid arthritis [1113].

Most published mapping studies have employed cross-sectional designs, and prediction models were typically derived from baseline (i.e., pre-intervention) patient data. This usually implies impaired HRQoL of the patient sample, which could affect predicted utility estimates when the same model is used on future (i.e., post-intervention) data, as required for calculating QALY gains in cost-utility analyses. The issue of longitudinal predictive ability of mapping equations is under-addressed in the literature and deserves more attention from researchers. The objective of this study was to compare post-intervention EQ-5D utilities predicted from baseline Modified Health Assessment Questionnaire (MHAQ) data with EQ-5D utilities reported by a cohort of Rheumatoid Arthritis (RA) patients at different time points after initiation of therapy and in particular to examine the effect of disease severity on the observed differences. The findings can be expected to add to the international body of knowledge on the subject and, from a methodological point of view, might be transferable to similar data sets from other studies.

Rheumatoid arthritis

Rheumatoid arthritis can have a major impact on HRQoL as it is a chronic, disabling and painful condition, leading to substantial loss of functioning and mobility [14]. It affects approximately 1% of the world’s population, women three times more often than men, and onset usually occurs between the ages of 40–50 [15]. Non-pharmacological treatment includes physical therapy, orthoses, occupational and nutritional therapy. Painkillers and anti-inflammatory drugs, including steroids, are used to suppress the symptoms, while disease-modifying antirheumatic drugs (DMARDs) are often required to inhibit or halt long-term damage. Recently, a newer group of biologics has increased treatment options, but cost as well. Measurement of disability and HRQoL in RA can be performed with a number of generic and/or disease-specific instruments, and the choice of instrument should be based on the situation and the goals of the particular study [16].

Methods

Measuring instruments

The EuroQol EQ-5D consists of a descriptive health state classification system with five domains (mobility, self-care, usual activities, pain/discomfort and anxiety/depression) and three severity levels in each (no problems, some problems and extreme problems) [17]. Combining one level from each domain defines 243 different health states ranging from full to worst health. Direct valuations for 42 health states were elicited from 3,395 persons of the UK general public using the TTO (Time Trade-Off) method. Regression techniques were applied to interpolate values for the other health states, and the utilities range from −0.59 to 1.00 [18, 19]. The EQ-5D was found to be applicable and adaptable to the Greek environment [20], and its construct validity has been demonstrated using a large sample of the Greek general population [21].

The term “HAQ” refers to a family of instruments derived from the Stanford Health Assessment Questionnaire [22]. In its original form the HAQ consists of 20 questions in 8 different categories, with each containing 2–3 common theme questions: dressing, standing, eating, walking, going to the toilet, reach, grip and instrumental activities. A score is derived for each category based on the most abnormal activity (question) in that category, and the final score is the average score of the eight categories. The “modified HAQ” or MHAQ is a subset of eight items taken from the eight categories, and was conceived as a less complicated and shorter version of the HAQ [23]. It is simple to score, and has the same range and scoring as the HAQ: without difficult (0), with some difficulty (1), with much difficulty (2) and unable (3). Thus, the range is 0–24 or 0–3 when the total score is divided by 8, in which case the total MHAQ score is the mean of the scores for each activity. The MHAQ can be easily administered in clinical practice, and its brevity makes it an attractive alternative in many rheumatic disorder studies. Compared to the HAQ, the MHAQ provides lower scores (especially for patients with high disease activity [24]) and correlates better with important clinical measures [25]. On the other hand, it is characterized by bunching of values at the lower end of the scale [26] and reduced sensitivity to change [27].

The DAS28 is a frequent outcome measure used in therapeutic trials, and is also used to guide treatment decisions and describe disease activity across populations [28]. The variables required for its calculation include the number of swollen and tender joints (using 28-joint counts), the erythrocyte sedimentation rate (ESR), and the patients’ general health (GH) or global disease activity measured on a visual analogue scale (VAS). The DAS28 is the basis for several other RA measurement tools, including the EULAR (European League Against Rheumatism) response criteria. The latter include not only changes in disease activity, but also current disease activity. To be classified as responders, patients should have a significant change in DAS and also low current disease activity. Three categories are defined: good, moderate and non-responders [29].

Sample and data collection

The sample consisted of 143 patients from the General Hospital “Asklipieio” Voulas, with active RA and suffering from the disease for at least 2 years. This cohort had previously responded inadequately to therapy with a DMARD (first therapy line) or a combination of a DMARD and a TNF-a inhibitor (tumor necrosis factor-alpha blocking agent aiming to decrease joint damage—second therapy line), and as a result were embarking on their second and third line of therapy, respectively. The former involves administration of a biologic TNF-an inhibitor: namely Infliximab or Etanercept or Adalimumab, and the latter Rituximab or Abatacept. All start and stop dates, doses, changes in doses and reasons for change were recorded. The baseline visit was at the moment of treatment initiation. Patient characteristics such as age, gender, disease duration, rheumatoid factor (RF-autoantibody was usually present in the serum of people with RA), previous DMARD and the presence of at least one erosion in the hands and/or feet were registered. A trained research nurse assessed the patients at baseline, and at 3, 6 and 12 months after therapy initiation. Assessments included disease activity measured by the DAS28, functional ability (measured by the MHAQ) and HRQoL (measured by the EQ-5D). All data were collected between January 2008 and August 2009. The interview lasted approximately 10 min, and all patients agreed to participate. The hospital’s Review Board approved the ethics of the study, and all participants provided informed consent.

Analysis

Ordinary least squares (OLS) regression was used to model the EQ-5D using the eight MHAQ domains as categorical variables, as this has been shown to provide a useful model [11]. The model defined in the present study is additive, implying linear independence between predictor variables. A backward inclusion procedure was employed, as it identified more significant MHAQ predictors compared to the forward and stepwise selection procedures. A full OLS model (i.e., with all MHAQ items as predictors) was also run, and its performance was similar to the backward model, which was chosen for parsimony. Predictive ability and explanatory power of the model were assessed via root mean square error (RMSE) and adjusted R 2, respectively. To further assess prediction accuracy, the RMSE was standardized on the measuring range of the model as % RMSE, which was calculated by dividing RMSE by the range of achievable EQ-5D scores.

The model was tested longitudinally on data collected at 3, 6 and 12 months post-therapy. Predicted and reported EQ-5D utilities were compared with the paired-samples t test, and minimally important differences (MID) were perceived at 0.03 or more, i.e., approximately equivalent to a previous estimate of the MID for the EQ-5D [30], although different MIDs have been estimated for the instrument [31]. Longitudinal application of the mapping model obviously generated a sample of 360 prediction errors overall, i.e., three per participant. To control for each patient being included three times in this sample, the absolute prediction error was modeled using fixed effects in terms of covariates such as age, gender, time, disease duration, treatment, RF, DAS28 score, predicted and reported EQ-5D. To gain insight into the sign of the prediction error, i.e., if the model over- or underestimated utilities across the range of health status of the sample, prediction errors (predicted minus reported utility) were plotted against reported EQ-5D, and the results are presented graphically in Fig. 1. The data were analyzed with SPSS version 15.0 and Microsoft Excel.

Fig. 1
figure 1

Comparison of reported EQ-5D utilities and prediction errors of the mapping model

Results

Table 1 shows the baseline characteristics of the patients. Disease duration was significantly longer in the third line therapy group (11.5 years vs. 16.2 years, P < 0.01). A non-significant gender distribution and age difference were noted across the two therapy lines, implying that the two groups were relatively homogenous in terms of demographics. A non-significant distribution of patients with positive and negative RF was also noted between therapy lines, and the Rituximab group (3rd line therapy) was the only one in which a within-group significant difference (P < 0.01) was observed (specific results not shown). The two therapy groups differed in scores from all three outcomes measures, as third line therapy patients demonstrated higher disease activity with the DAS28 (p = 0.001), worse functional status with the MHAQ (P = 0.008) and lower HRQoL with the EQ-5D (P = 0.009). Twenty-three patients, corresponding to 16.1% of the initial sample, dropped out of the study at some point because of complications and were thus excluded from all the following analyses, in which only patients providing complete longitudinal data were included (N = 120).

Table 1 Patient baseline characteristics by treatment scheme

All three outcome measures generated scores in the expected directions over the entire study period, i.e., biologic therapy decreased the disease activity and improved the functional status and HRQoL, implying the improvement of health over time in this cohort. These data are presented in Table 2. Specifically, disease activity decreased and, according to EULAR criteria, was characterized as moderate (3.2 < DAS28 ≤ 5.1) from baseline to 6 months and low (DAS28 ≤ 3.2) at 12 months. Baseline functional status followed a similar pattern and showed significant (P < 0.001) improvements over time (expressed by lower MHAQ scores). HRQoL, measured via the EQ-5D, also improved with time. These results also provide evidence in favor of the longitudinal validity of these instruments in RA patients, something that might have been expected for the disease-specific measures (DAS28 and MHAQ), but could only be a priori hypothesized for the generic instrument (EQ-5D).

Table 2 Outcome measures longitudinal descriptive statistics

With OLS regression and baseline patient data, a mapping model was derived, and the parameter estimates, predictive ability and goodness of fit are summarized in Table 3. Half of the MHAQ items (dressing, eating, reach and grip) were significant (P < 0.05) predictors of EQ-5D utilities. The explanatory power of the regression model was satisfactory (R 2 = 0.452), although high explanatory power does not necessarily imply good predictive ability as well. RMSE was calculated to be 0.207. The RMSE, which is usually preferred over R 2 in mapping studies [2], favors models that do not produce particularly large errors, and the smaller its value the better the performance [32]. The achievable range of EQ-5D utilities from this model was 1.28 (−0.348 to 0.936), which is close to the achievable range from the standard [18] scoring algorithm (−0.59 to 1.0). The standardized RMSE for the mapping model was 16.1%. To assess the longitudinal predictive ability of the mapping equation, the model was tested on the other three data sets, corresponding to different time points within the study period.

Table 3 OLS regression model derived from baseline patient data

According to the information in Table 4, the model appeared to underestimate future utilities, with an overall mean prediction error of about 6.5%. It is noteworthy that the mean absolute differences between reported and predicted EQ-5D utilities at 3, 6 and 12 months exceeded the typically reported MID of the instrument (0.03). However, by examining only mean prediction errors we cannot obtain evidence on the factors that affect the magnitude of these errors—since post-therapy samples include patients exhibiting a wide range of health states—or on the sign of the error, i.e., whether the model over- or underpredicts utilities.

Table 4 Predicted EQ-5D utilities

The results of the fixed-effects model (Table 5) identified factors with a significant impact on the size of the prediction error. Time (assessment at post-intervention compared to baseline), lower predicted EQ-5D and higher DAS28 scores, with the latter two corresponding to more severe health status, were significant predictors of a higher absolute prediction error. Age had a borderline significant negative effect (P = 0.052), whereas gender was not a significant factor. The disease-related variables, i.e., increased disease duration, third line therapy and positive RF, which imply increased disease severity, were positively associated with the prediction error (thus confirming the effect of the other covariates of health status), but did not have a significant effect on the size of the error.

Table 5 Fixed-effect model for the absolute EQ-5D prediction error

We plotted the prediction errors (i.e., predicted utility minus reported utility) against the range of health status of the sample, proxied via reported EQ-5D utilities, and the results are presented graphically in Fig. 1. The purpose of this analysis was to obtain a graphic representation of the relationship between prediction error and health status, and not to mathematically model such a relationship; hence, we have pulled together and combined the observations from all three time points, i.e., 3, 6 and 12 months for parsimony, instead of depicting three separate graphs. Prediction errors were increasingly negative for lower reported EQ-5D scores, implying that predicted utilities tended to be lower than reported ones in more severe health states. This observation comes to confirm the results from the regression model.

Discussion

Mapping can be valuable for analyzing data from studies in which no preference-based instruments were used. It provides an opportunity to perform CUAs with disease-specific questionnaires—something that is usually not possible—and concurrently to exploit their increased sensitivity to disease aspects often overseen by generic instruments. However, mapping should be the alternative strategy because generic and disease-specific instruments serve different purposes, and one cannot completely substitute for the other. There is always an inherent error in mapping [33, 34], suggesting the use of preference-based utility instruments in prospective economic evaluations of health care interventions. On the other hand, studies having included only non-preference based measures of HRQoL may benefit from mapping for estimating utilities, and this is an important implication. Examining the longitudinal validity of mapping requires that a model derived from a pre-therapy patient group be able to validly predict future utilities, i.e., the health status of post-therapy patients, which is typically improved. This is the typical situation encountered in QALY calculations and apparently has not received its due attention in the literature.

In this study a mapping model was developed to predict EQ-5D utilities from baseline MHAQ data and was used on longitudinal RA patient data sets, collected at 3, 6 and 12 months after initiation of treatment with a biologic agent. The objective was to compare predicted and reported utilities at post-intervention periods and to specifically look at the impact of disease severity on these differences. The results showed an underestimation of utilities by the mapping model exceeding 0.3 on average in all the longitudinal samples, which is the suggested MID for the EQ-5D [31]. This implies that if the model were used in cost-utility analyses (CUAs) of RA interventions, which is the objective of mapping, these discrepancies could affect the estimation of QALYs and/or QALY gains. The implications of using predicted utilities in CUAs are however only theoretical and require empirical assessment—not addressed in this study—before conclusions can be drawn. It is worth mentioning that a recent study comparing QALY gains and incremental cost per QALY estimates from reported and predicted utilities showed that differences between the two indeed exist [35].

Our findings are consistent with the claim that utilities estimated by mapping algorithms typically have larger errors in more severe health states, and less error is obviously seen at better health states [2, 19]. In terms of the sign of the error, i.e., under- or overprediction of utilities, the literature suggests that mapped EQ-5D utilities show a systematic over prediction of utility values for patients in poor health [35, 36]. The results from the present study are contrasting in this respect since prediction errors were increasingly negative (i.e., predicted utilities tended to be lower than reported ones) in more severe health states. This could be a result of the dynamic design of the study. Fitting the mapping at baseline and then using the model at different (i.e., less) disease severity is, to some extent, extrapolation beyond the support of the data. There might also be an underlying issue of EQ-5D insensitivity to certain disease-specific changes, keeping in mind the recognized challenges with the distribution of its responses [11, 37, 38].

Mapping equations can only encompass health changes captured by the disease-specific instrument used [4], and this might provide additional insight into our findings. The MHAQ is a relatively brief and simple instrument to complete and score (compared, e.g., to the more complex HAQ); however, these advantages might make it less sensitive to functional loss and change [39]. Hence, the MHAQ is perhaps not capable of fully capturing the benefits of treatment, resulting in lower utilities, particularly in patients with better HRQoL. The strength of mapping usually depends on the overlap between the instruments in terms of their descriptive systems. Three of the EQ-5D dimensions, i.e., mobility, self-care and usual activities, fall within the domain of function, suggesting considerable overlap with the MHAQ. On the other hand, the other two EQ-5D dimensions, i.e., pain/discomfort and anxiety/depression, which are heavily weighted in the EQ-5D scoring algorithm [18], may not be picked up by the MHAQ, which does not encompass similar domains. It has been shown that very low utility scores and states “worse than death” derive from the contribution of pain and mood [37]. In a similar study, this was overcome by including a VAS pain scale and the SF-36 mood scale as continuous variables [13]. The validity and responsiveness of preference-based instruments, and particularly the EQ-5D, have been demonstrated in many RA studies, and a recent review has pointed out the need for more longitudinal comparisons [40]. The use of different utility instruments in patients with RA has led to different utility scores and, when applied to cost utility analyses, resulted in different QALY depending on which scale was used [41, 42].

In the majority of RA clinical trials the HAQ is the main, and often the only, measure of HRQoL, although it has been pointed out that it is sensible to include a disease-specific and a generic measure of health status in all future clinical trials [43]. Many studies have compared the performance of HAQ and utility instruments in RA samples, and in many cases have concluded that the latter adequately discriminate across levels of RA severity [44]. A wide variety of predictor variables have been mapped in different illnesses [2]. In rheumatic diseases in particular, the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) has been mapped to the EQ-5D [35, 45] and the HUI [32]. In RA, mapping efforts have used the HAQ to predict EQ-5D and SF-6D scores [11], and showed that use of selected domains from the HAQ best predicted EQ-5D scores, with RMSE and R 2 values of 0.183 and 0.57, respectively, which are comparable to the values from our study (0.207 and 0.45), keeping in mind that we used the less sensitive MHAQ. In another study mapping HAQ and combinations of covariates to the UK EQ-5D, predictive ability and fit ranged from 0.19 to 0.21 and 0.43–0.54, respectively [13]. Another recent mapping study in RA showed that a HAQ-mapped SF-6D score was valid and responsive in early RA [12]. We are unaware of any studies having mapped the MHAQ to a utility.

The policy implications of this study are important. If indeed the model is underestimating utilities at more severe health states, this could have consequences such as underestimation of QALYs and distorting results of economic evaluations. A solution might be to examine patient severity groups separately. According to a recent study looking extensively into the issue of mapping onto EQ-5D for patients in poor health, the large decrement of reporting “extreme problems” in the EQ-5D tariff, the few observations with the most severe level in any EQ-5D dimension and the many observations at the least severe level in any EQ-5D dimension led to a bimodal distribution of EQ-5D index values and resulted in noteworthy prediction errors. The authors concluded that “Specifying a separate mapping algorithm to predict utility values for individuals in poor health greatly reduces over prediction, but does not fully solve the problem” and that “Further research is needed to determine if specifying two functions and combining them is to be favoured over other approaches” [46].

The study sample consisted of RA patients having responded inadequately to some form of previous treatment, implying that more severe cases were generally included. Hence, the sample cannot be considered to be representative of RA patients, as newly diagnosed cases were not included. From a methodological viewpoint, a potential limitation may stem from the selection of linear regression to develop the mapping equation. The justification of this choice is that OLS is—by far—the most widely used method in mapping studies [2], and this provides the potential for cross-study comparisons. On the other hand, EQ-5D utilities, as well as all others, are bounded at both ends; therefore, it may be that other techniques, such as the Tobit regression or censored least absolute deviations (CLAD) estimator, could result in more accurate estimations. Furthermore, the present model excluded patient background characteristics such as age, gender and other disease-related aspects, as this would have increased model complexity. In any case, it has been reported that only “…modest or negligible improvements were achieved from increasing model complexity” [2].

In conclusion, this study has added to the ongoing discussion on the potential usefulness of mapping in CUAs, particularly with data sets from studies not including preference-based instruments. In the disease area of RA, this work can be seen as complementary to mapping efforts from other researchers. It is the first study to use the MHAQ as the start measure to predict a utility, as we are only aware of it being used, up to now, to predict HAQ scores [47]. Our specific context was patients embarking on a new therapy line after an inadequate response to previous therapy with a DMARD or a TNF-a inhibitor. This implies that further testing is required to determine if the mapping equation can be used on other RA patients in different situations. Finally, the finding that post-treatment utilities are underestimated, at least in more severe patients, requires further investigation before it is prudent to conduct cost-utility analyses in the context of RA by means of the MHAQ alone.