Background

Over the last two decades, various disease-modifying and symptomatic treatments have been developed for people with Multiple Sclerosis (MS). Meanwhile, increasing emphasis has been placed on achieving “value for money” within healthcare systems [1]. Clinical trials of interventions that target particular symptoms frequently use symptom-specific outcome measures in order to maximise sensitivity and responsiveness to change. Fatigue is the most common symptom experienced by people with MS, and has a considerable impact on quality of life [2]. The Fatigue Severity Scale (FSS) [3] is frequently used in clinical trials of interventions for fatigue in people with MS, including carnitine, amantadine, aspirin, modafinil and cognitive behavioural therapy [4,5,6,7]. Symptom-specific outcome measures, such as the FSS, provide a standardised means of describing “health states” that may be experienced by patients, but do not provide data in the format required by many decision-making bodies to assess cost-effectiveness [1].

The quality-adjusted life-year (QALY) is recommended for use as an outcome measure for cost-effectiveness analyses by several national decision-making bodies, eg the National Institute for Health and Care Excellence (NICE) [8,9,10]. QALYs combine quantity and quality of life in a single measure, by adjusting the number of life-years lived according to the quality-of-life experienced during those years [1]. In order to estimate QALYs, numerical values must be assigned to reflect the quality of life experienced when living in particular health states. These values are commonly obtained using preference-based measures (PBMs) of health-related quality of life [11].

However, many clinical trials do not include a PBM, limiting the ability to conduct economic evaluations. In such cases, statistical procedures may be used to “map” scores on non-preference based outcome measures, such as the FSS, to health state utility values (HSUVs) derived from PBMs. “Mapping” involves regression analysis, using a dataset containing responses to both measures from the same sample, to derive an algorithm that can be used to convert data from non-preference-based measures into HSUVs. Over recent years, the use of mapping has increased considerably [11]. Previous studies have reported on mapping from MS-specific outcome measures including the Multiple Sclerosis Impact Scale and the Multiple Sclerosis Walking Scale-12 [12,13,14]. However, no approach has been reported that uses fatigue measures to map to HSUVs in the context of MS.

Methods

This paper uses statistical techniques to map from the FSS (the “source measure”) to HSUVs derived from three preference-based measures: the EQ-5D, SF-6D and MSIS-8D (the “target measures”). The aim is to derive algorithms to convert FSS scores into HSUVs for use in assessing the cost-effectiveness of treatments for fatigue in people with MS. The statistical approach presented here is based on good practice methodology, and is consistent with the recommendations regarding mapping methods from NICE in the UK [15] and the international ISPOR Good Practices for Outcomes Research Task Force [16].

Measures

The Fatigue Severity Scale (FSS) has acceptable reliability, internal consistency, sensitivity and responsiveness for people with MS [3, 17,18,19,20,21]. It comprises nine statements, describing the severity and impact of fatigue, with a scale of possible responses ranging from 1 (“strongly disagree”) to 7 (“strongly agree”). FSS total scores are usually reported as the mean score over the nine items; a higher score indicates greater severity.

The EuroQoL EQ-5D-3L has five dimensions (mobility, self-care, usual activities, pain/discomfort, anxiety/depression) with three response levels per dimension - no problems, some problems or extreme problems/confined to bed. HSUVs were derived from the preferences of a representative sample of the UK general population, using a variant of the time trade-off (TTO) technique, and range from − 0.594 to 1.000 [22]. The EQ-5D is widely used in economic evaluations, particularly in the UK, where NICE recommend it as the preferred measure of health outcomes for cost effectiveness analyses [8].

The Short-Form 6D (SF-6D) enables HSUVs to be estimated from a popular non-preference based measure of health-related quality of life (HRQoL), the Short-Form 36 (SF-36). It consists of six dimensions (physical functioning, role limitations, social functioning, pain, mental health, vitality) with between four and six response levels. Preferences were elicited from a representative sample of the UK general population using the standard gamble technique and values range from 0.301 to 1.000 [23]. The dataset used for analysis includes responses to Version 1 of the SF-36 from earlier waves of data collection, before this was replaced by SF-36 Version 2, which was developed to address concerns about the structure and wording of some items [24]. Given that the component items of the SF-6D classification system differ between the two versions, we only included responses to Version 2 of the SF-36 in this analysis, in order to ensure consistency.

The Multiple Sclerosis Impact Scale 8D (MSIS-8D) enables HSUVs to be estimated from responses to a MS-specific outcome measure, the Multiple Sclerosis Impact Scale (MSIS-29). It includes eight dimensions (physical function, social and leisure activities, mobility, daily activities, mental fatigue, emotional well-being, cognition, depression) with four response levels each [25]. HSUVs were derived from a TTO survey with a sample of the UK general population. Values range from 0.079 to 0.882. It was not assumed that the best health state described by the MSIS-8D classification system (ie “no problems” on all dimensions) was equivalent to perfect health, therefore the value of this health state was not constrained to 1 [26]. The MSIS-8D was derived from Version 2 of the MSIS-29 [21], which has four response levels per item, rather than Version 1 of the MSIS-29, which has five response levels [27]. Therefore, although earlier waves of data collection used Version 1 of the MSIS-29, only responses to Version 2 were included in this analysis.

Dataset

The South West Impact of Multiple Sclerosis (SWIMS) project is a longitudinal cohort study of people with MS aged 18 or over, living in Devon and Cornwall. Respondents complete six-monthly questionnaires, including several patient-reported outcome measures alongside clinical and demographic characteristics. The study was approved in the UK by the Cornwall and Plymouth and South Devon Research Ethics Committees, and written informed consent is obtained from all participants.

This analysis used SWIMS data received between August 2004 and October 2012. Only data collected at baseline were included, as this is the only point at which the FSS, EQ-5D, SF-36 and MSIS-29 are completed simultaneously. A random sample of 75% of the baseline data were used as the estimation dataset (n = 1056), with the remaining 25% constituting the validation dataset (n = 352) [11, 28]. As Table 1 shows, there were no significant differences (p < 0.05) between the datasets in terms of mean FSS total scores, mean HSUVs, or recorded demographic or clinical characteristics. The mapping algorithms were derived using data from respondents who provided answers to all questions required to produce both a FSS total score and a HSUV from the target PBM: 1023 respondents for the EQ-5D, 607 for the SF-6D and 650 for the MSIS-8D (response numbers are lower for the SF-6D and the MSIS-8D as only version 2 of these questionnaires were included). All statistical analysis was undertaken in Stata 14.

Table 1 Summary of respondent characteristics, comparison of estimation and validation datasets

Preliminary assessment of measures

Two key conditions must be met for mapping: there should be conceptual overlap between the source and target measures, and the target measure should demonstrate discriminative validity with respect to the severity of the condition captured by the source measure [11, 29]. To assess conceptual overlap, the FSS items and the dimensions of the PBMs were allocated to a multi-dimensional conceptual framework, which was developed for this study in order to provide a structure for comparing the content of the measures. The measurement concept underpinning the three PBMs is (HRQoL) [22, 23, 25]. Therefore, the conceptual framework was structured around the commonly agreed key dimensions of HRQL, which comprise physical and mental domains alongside a third domain relating to social and role function and participation [30,31,32]. The framework was constructed based on a systematic literature review of qualitative research into the impact of fatigue on people with MS (details of this review are included as Additional file 1: A).

Pearson correlation coefficients were assessed between the total FSS score and HSUVs from each of the PBMs, while Spearman correlation coefficients were assessed between FSS total scores and individual dimension scores for each PBM, and between HSUVs and individual FSS item scores. Assuming that these instruments measure distinct but related concepts, we expected to find relationships of moderate strength, ie correlation coefficients between 0.3 and 0.6 [33]. To assess the discriminative validity of the PBMs, respondents were categorised into fatigue severity groups: “mild/ no fatigue” (FSS total ≤ 35), “moderate fatigue” (36 ≤ FSS total ≤ 52) and “severe fatigue” (FSS total ≥ 53). The definition of “mild/ no fatigue” was based on the published cut-off point for the FSS [17]. The ability of the PBMs to differentiate between the three groups was investigated using ANOVA and standardised effect sizes. Effect sizes can be assessed as small (0.20–0.49), moderate (0.50–0.79) or large (0.80 or over) [34].

Development of mapping algorithms

Exploration of model specifications

The relationships between the source and target measures were examined using statistical conventions reported in the mapping literature [29, 35]. The distribution of scores on each of the measures was explored by the production of histograms and, the relationship between each of the PBMS and the FSS total score was investigated by production of scatterplots. Five regression models were estimated for each PBM. HSUVs were regressed on the:

  • Total FSS score (Model A);

  • Total FSS score and total FSS score squared (Model B);

  • Total FSS score, age and gender (Model C);

  • FSS item scores (Model D);

  • FSS item scores, age and gender (Model E).

The majority of mapping studies estimate algorithms using ordinary least squares (OLS) models [35]. However, OLS models can predict values outside the possible range for a PBM, and can lack predictive accuracy for extreme HSUVs. To address this, Tobit models were also considered, specifying an upper limit of 1 [29]. OLS and Tobit models rely on an assumption of no heteroscedasticity. Where this assumption was violated according to White’s test for heteroscedasticity, the ‘vce(robust)’ option was used in conjunction with the ‘regress’ command for the OLS analyses, and Censored Least Adjusted Deviation (CLAD) estimation methods [36] were used instead of Tobit models, employing the ‘clad’ command with a specified upper limit of 1.

Predictive ability was assessed using the following estimation errors: mean absolute error (MAE), root mean squared error (RMSE) and the proportions of estimates that fell within 0.05, 0.10 and 0.25 of the observed HSUV. MAE was selected as the primary criterion for selection of the preferred models [11]. However, if coefficients had unexpected signs these models were not selected. In instances where model MAEs were the same, the model with the best profile of estimates falling within 0.05, 0.10 and 0.25 of the observed HSUV was selected.

Two researchers decided independently which models to would take forward for validation. Where discrepancies arose, these were resolved through discussion until consensus was reached. Demographic variables may not be included in the datasets from which HSUVs are to be estimated. Therefore, where the best performing models included demographic variables, the best performing model without demographic variables was also selected.

Validation and model selection

Estimation errors were assessed according to the severity of the health state. The selected models were applied to the validation dataset and their performance was assessed using the criteria outlined above.

Results

Preliminary assessment of measures

The conceptual framework that was developed to assess conceptual overlap between the measures is illustrated in Fig. 1. Most of the themes that had been identified in the original qualitative research studies fitted into the three domains of HRQoL that were defined a priori. There were two notable exceptions. Several of the themes described the experience of fatigue itself, rather than its effect on HRQoL. This experience was clearly of great importance to the people with MS who contributed to the original research, and underpinned the ways in which fatigue impacts upon HRQoL. Therefore, an additional domain was added: “Descriptions of fatigue”. In terms of the links between themes, a clear relationship emerged between “functioning and participation” and “psychological well-being”. People with MS specifically identified negative effects on their psychological well-being that were caused by the impact of their fatigue on their functioning and participation. These stood alongside, but distinct from, the direct impact of fatigue on psychological well-being. Therefore, this became a domain in its own right.

Fig. 1
figure 1

Conceptual framework

In terms of conceptual overlap, the FSS and all PBMs cover the three primary domains of the conceptual framework (Physical, Mental and Participation Effects) (Table 2). Coverage of Participation Effects is strong across all four measures. The FSS, SF-6D and MSIS-8D capture a wide range of Physical Effects, whereas the EQ-5D includes only specific dimensions for pain/discomfort and mobility. In terms of Mental Effects, the FSS includes one item relating to motivation, while the PBMs describe other specific symptoms eg depression or anxiety. Only the MSIS-8D includes cognitive effects. The MSIS-8D and SF-6D include dimensions relating specifically to fatigue or vitality.

Table 2 Comparison of measures against conceptual framework

Significant (p < 0.0001) moderate correlations were evident between the FSS total score and HSUVs derived from the EQ-5D (r = − 0.455) and the MSIS-8D (− 0.590). There was a large significant correlation (p < 0.0001) between the FSS total score and HSUVs derived from the SF-6D (− 0.647). The FSS total score was significantly correlated with all individual dimensions of the PBMs, and HSUVs derived from each of the PBMs were significantly correlated with all individual items of the FSS (p < 0.0001). Most correlations were moderate, as anticipated, and all had the expected negative sign, ie higher FSS scores are related to lower HSUVs (Table 3).

Table 3 Correlations between Fatigue Severity Scale and preference-based measures

28.4% of respondents with a valid FSS total score were in the “mild/ no fatigue” category, 36.6% were in the “moderate fatigue” category and 35.0% were in the “severe fatigue” category. All PBMs discriminated significantly between fatigue severity groups (p < 0.0001). The SF-6D performed particularly well, with large standardised effect sizes (≥0.80). Overall, standardised effect sizes were higher for the MSIS-8D than for the EQ-5D (Table 4).

Table 4 Discriminative validity

As a result of the preliminary assessments, it was judged that conceptual overlap and discriminative validity were sufficient to proceed with the estimation of mapping models. Overall, the SF-6D and MSIS-8D provide a better fit with the FSS.

Results of mapping analysis

Exploration of model specifications

In order to allow for heteroscedasticity, skewness and kurtosis identified in the data, we fitted robust OLS models and used a CLAD rather than a Tobit specification. (The distribution of scores on each of the measures, and the relationships between scores on the PBMs and the FSS total score is shown in the Additional file 2 B and Additional file 3: C). Thirty models were considered, with Models A to E estimated for each PBM, using both OLS and CLAD specifications.

There was little difference between the predictive ability of the models based on FSS total scores and individual FSS items. In all models, item FSS-08 had a significant coefficient with an unexpected sign, and a majority of the FSS items (ranging from five to seven of the nine items) were not significant predictors of HSUVs. Furthermore, data on individual FSS items may not be available in all potential applications of the mapping algorithms. Therefore selection was restricted to algorithms based on the FSS total score.

EQ-5D

CLAD C had the lowest MAE and the highest proportion of individuals with small prediction errors. We also selected CLAD A, as the model which did not include demographic variables with the lowest MAE.

SF-6D

OLS B and CLAD B had coefficients with unexpected signs and were, therefore, not selected. We selected CLAD C as it had the next lowest MAE, and OLS A and CLAD A, as they did not include demographic variables.

MSIS-8D

CLAD B and OLS B had the lowest MAEs, however these had unexpected signs for FSS total, and so were not selected. The model with the next lowest MAE and highest proportion of individuals with small predictions errors was CLAD C. As this model included demographic variables, we also selected the model with the next lowest MAE (0.117), CLAD A.

Details of the selected models are presented in Table 5. All model results are provided in Additional file 4: D.

Table 5 Models mapping from FSS total to PBMs using estimation dataset

Validation and model selection

The validation dataset was used to assess estimation errors for the selected models (Table 6). Table 7 shows MAEs for ‘poor’ and ‘good’ health states by model. The models predicting HSUVs for the EQ-5D and MSIS-8D had larger MAEs for poorer health states, indicating that these models performed less well at estimating scores for those in poorer health states. The opposite was true for the SF-6D models, although the difference in MAEs here was less marked. (Please see Additional file 5: E and Additional file 6: F).

Table 6 Models mapping from FSS total to PBMs using validation dataset
Table 7 Mean absolute errors by severity group

Discussion

Here we describe and demonstrate a method for converting responses to the FSS, a frequently-used measure of fatigue severity, into HSUVs, which can be used to estimate QALYs for use in cost-effectiveness analyses, and hence to inform decision-making regarding the availability of treatments for MS-related fatigue. According to the Oxford Health Economics Research Centre’s Mapping Database, last updated in April 2019 [37], no previous published studies have attempted mapping from the FSS. In addition, we have found no previous studies which have investigated correlations between the FSS and the SF-6D or the FSS and the MSIS-8D, and just two which have explored the relationship between the FSS and the EQ-5D [38, 39]. Rosa et al. [39] correlated FSS total scores with participants’ scores on the EQ-5D visual analogue scale, rather than with the EQ-5D HSUVs that are relevant for mapping, and Tremmas et al. [38] found no statistically significant correlation between the FSS and EQ-5D scores of people with lung cancer.

The ability of the models selected in the current study to predict SF-6D and MSIS-8D values is in keeping with results reported in other mapping studies [35]. There are currently no guidelines regarding acceptable limits for estimation errors [13], but MAEs ranging from 0.0011 to 0.19 have been previously described [35]. In the current study, the SF-6D MAEs of 0.078 and 0.077 and the MSIS-8D MAEs of 0.117 and 0.116, fall well within this range and, specifically in the context of MS, they are in keeping with the MAE of 0.058 reported by Hawton et al. [12] when the MSIS-29 was mapped to the SF-6D.

Results for the EQ-5D algorithms were less convincing. The prediction errors of 0.175 and 0.173 are towards the higher end of MAEs reported in previous mapping studies [35], and are also high in the context of MS mapping studies. Versteegh et al. [13] mapped from the version 1 of the MSIS-29 to the EQ-5D, with resulting MAEs of 0.13 and 0.16, and Hawton and colleagues [12] mapped from version 2 of the same measures to the EQ-5D with a MAE of 0.147. In addition, when testing the external validity of the Versteegh et al. [13] algorithm, Ernstsson et al. [40] reported a MAE of 0.12.

Information is inevitably lost in the process of mapping, as the resulting algorithm will only reflect the areas of content that overlap between the starting and target measures. This information loss is accentuated when a domain-specific, condition-specific measure, such as the FSS, is mapped to a generic, multi-dimensional measure, such as the EQ-5D. Therefore, greater predictions errors might be anticipated when mapping from such a uni-dimensional scale as the FSS than when mapping from a multi-dimensional scale such as the MSIS-29 [41]. However, this does not appear to hold in the MS mapping literature to date, with Hawton et al. [14] reporting a MAE of 0.148 when they mapped from the MS Walking Scale-12 (a mobility-specific, MS-specific measure) to the EQ-5D, and Sidovar et al. [42] described an error statistic of 0.109 when mapping to/from these same measures.

In the current study, the EQ-5D algorithms were particularly problematic for HSUVs below 0.65. They did not predict any values below 0.54 (assuming an age of 50 years and female gender for CLAD Model C), which is of particular concern for a measure with a minimum value of − 0.594.

On the basis of the statistical assessments reported here, the qualitative assessments of conceptual validity, and setting our findings in the context of other mapping studies in MS and mapping studies more generally, we suggest the use of the following algorithms for mapping from the FSS to HSUVs.

SF-6D estimate = 0.897–0.006*FSS total score

MSIS-8D estimate = 1.084–0.008*FSS total score – 0.001*age – 0.024*gender [0 male, 1 female] or if age and gender are not available:

MSIS-8D estimate = 0.985–0.007*FSS total score

Based on these same assessments, we suggest the EQ-5D algorithms are far less likely to produce accurate or valid estimates of EQ-5D scores.

There are a number of potential limitations of this work. Firstly, the SWIMS data were collected prior to the development and use of the EQ-5D-5L and the mapping algorithms were based on the ‘older’ EQ-5D-3L. It may have been expected that the EQ-5D-5L would supersede the EQ-5D-3L as it was developed with five, rather than the original three, levels in an attempt to improve its responsiveness. However, the English HSUV set for the EQ-5D-5L is not in common use, and if using the EQ-5D-5L descriptive system, the current ‘position statement’ of NICE is to use a cross-walk algorithm to provide HSUVs from the EQ-5D-3L value set. Secondly, the SF-6D value set is based on the use of standard gamble to elicit preferences for health states. This may result in higher HSUVs (than the EQ-5D), as respondents tend to be risk adverse. Thirdly, we did not explore the performance of some of the ‘newer’ mapping model specifications, such as limited dependent variable mixture models or beta-based regression, which may have better accounted for the bi-modal nature of the EQ-5D data. There is some empirical evidence in support of these models, but the ISPOR Task Force report [16] does not advocate any specific regression approach for mapping, recognising that the performance of different methods will vary dependent on a number of factors including the nature of the starting/target measures, the disease, and the patient population. The report suggests it is wise to use a model type for which there is existing evidence of good performance. In the context of MS, mapping algorithms which have used the same regression approaches that we have used here have been reported with MAEs of 0.058 [12], 0.13 and 0.16 [13], 0.147 [12], 0.12 [40], 0.148 [14] and 0.109 [42]. Brazier et al.’s [35] systematic review of mapping studies reported MAEs of 0.0011 to 0.19. Therefore, the regression approaches in the current paper have a track record of use and acceptability in the context of MS. The MAEs reported here for the SF-6D and MSIS-8D are in keeping with those reported in these other mapping studies. The poor performance of the EQ-5D algorithms is likely to be a function of the limited conceptual overlap between the EQ-5D and the FSS. The limited shared conceptual content of these measures will not be altered by using a different form of regression analysis. Thirdly, algorithms to predict HSUVs from individual FSS items, rather than the total score, were not generated by this study. This was, in part, due to an anomaly affecting item FSS-08 (Fatigue is among the most disabling of my symptoms). While the item correlated negatively (as expected) with HSUVs when considered in isolation, it had a positive coefficient when included as an independent variable in regression analysis. Further research would be required to understand the mechanisms behind this; in the meantime, it is not possible to determine whether this item is suitable for inclusion in a mapping algorithm.

A particular strength of this study is the nature of the SWIMS dataset. It has provided comprehensive data on which to base the estimation and validation of these mapping algorithms. Importantly, the cohort is comparable with other UK-based samples of people with MS in terms of age, gender, relapse rates and duration of illness [8, 43,44,45,46,47], meaning the algorithms should apply generally to people with MS, rather than just to specific sub-groups. In addition, the work undertaken to explore the content overlap between the measures provided a form of ‘triangulation’ in assessing the appropriateness of the mapping algorithms. Drawing on good quality qualitative research findings regarding the impacts of fatigue on HRQoL and developing a conceptual framework, provided unique insights into why the measures did and did not map well.

It is acknowledged that mapping methods are a second-best option to directly collected HSUVs for estimating QALYs [29, 41, 48]. Use of mapping increases the uncertainty and error around estimates of HSUVs [29], and is particularly problematic when there is little content overlap or relationship between the measures being mapped to and from [41]. However, when PBM data are not collected directly in a trial, empirically-evidenced mapping algorithms may be used. With the exception of the EQ-5D, the algorithms reported here can be used to support improvements in decision-making where primary PBM data are unavailable.

Conclusions

We present statistical algorithms that allow data from the FSS, a fatigue-specific patient-reported outcome measure, to be used in the estimation of QALYs, which are a suitable and policy-relevant measure for use in cost-effectiveness analyses. This will enable the results of studies using the FSS to inform decision-making in a health technology assessment context.