Introduction

Multiple sclerosis (MS) is a chronic disease resulting from inflammation and demyelination in the central nervous system (CNS) [1] that is associated with a variety of symptoms, such as fatigue, impaired mobility and cognitive decline [2]. Several new therapies, behavioural [39], medical [1014], and surgical [1519], have been developed in the field of MS. As there are both benefits and harms from interventions, the importance of considering the patient’s perspective in the evaluation of these new therapies is increasingly being emphasized. Patient-reported outcomes are used to evaluate the patient’s perspective on the impact of the disease and its treatment on symptoms, function, and other aspects of quality of life (QOL). QOL is defined as an “individuals’ perception of their position in life in the context of the culture in which they live and in relation to their goals, expectations, standards and concerns [20].” QOL is a global construct that includes domains other than health such as job satisfaction, quality of housing, and the neighborhood in which one lives [21]. Health-related quality of life (HRQL), on the other hand, is a construct that is narrower and focuses on domains within the purview of the health care system, such as normal ranges for physiological variables, physical, mental and social well-being [22, 23]. Health status, a term often confused with HRQL, is a description and/or measurement of the health of an individual or population at a particular point in time against identifiable standards [24].

While there are a common set of domains that are relevant across a wide variety of health conditions, including none, these domains may be affected differentially because of the positive and negative effects of interventions. For example, a treatment may have a positive effect on one domain (e.g. mental health) but a negative one on another (e.g. physical health) and this would be condition and intervention specific.

The most widely used methodology to create an index that weighs gains in one domain against losses in another is based on utility theory. Utility measures (or preference-based measures) provide a single value for the construct (health status, HRQL, or QOL) ranging from 0 (for death or worst possible health state) to 1 (for perfect health or best possible health state) [2529]. This value is used to calculate what is termed a “Quality-Adjusted Life Year” (QALY) which captures the effect of an intervention on quantity of life (mortality) and “quality of life” (which is conceptualized as morbidity) [3033]. The “Q” in QALY is a misnomer given it measures only the health aspects of QOL, the other aspects, which have been elegantly identified by Flanagan, are physical and material well-being, relations with other people, social community and civic activities, personal development and fulfillment, and recreation [34].

The three most widely used utility measures, namely the Health Utilities Index Mark 2 and 3 (HUI2 and HUI3), the EuroQol-5D (EQ-5D) and the Short-Form-6D (SF-6D), label the constructs underlying these measures as health status and/or HRQL [3539]. None list QOL as the construct being measured. Yet, for economic evaluation, the QALY is the parameter calculated and compared with cost.

In line with guidelines for economic evaluation from agencies such as the National Institute for Health and Clinical Excellence (NICE) and the Canadian Agency for Drugs and Technologies in Health (CADTH), these measures are currently being used to evaluate the cost-effectiveness of different interventions in MS. However, the challenge of using such measures in people with a specific health condition, such as MS, is that they may not capture all of the domains that are impacted upon by the health condition. If important domains are missing from the generic measures, the value derived will be higher than the real impact creating invalid comparisons across interventions and populations.

Personalized measures have been proposed as a method for identifying those aspects of a health condition that impact on QOL. While they may differ from person to person and across health conditions, the value derived from them represents QOL. The most commonly used individualized measures of QOL are the Patient Generated Index (PGI) and the Schedule for the Evaluation of Individual Quality of Life-Direct Weighting (SEIQOL-DW). Both measures capture the individual’s perspective on QOL, by permitting him/her to nominate the areas of life that are most important and assign a weight to each domain. Personalized measures of QOL have been used in several clinical trials to evaluate the effectiveness of different interventions on overall QOL [4044]. Furthermore, these measures have shown to be particularly useful in clinical settings by improving patient-physician communication and by helping prioritize treatment options [4547].

The global aim of the study is to contribute evidence for the content validity of generic utility measures with respect to capturing the relevant domains for people with MS. The specific objective was to estimate the extent to which generic utility measures capture important domains that are affected by MS.

Methods

Subjects

The data for this study comes from a study of the life-impact of people diagnosed with MS during the era of magnetic resonance imaging (MRI) and disease modifying therapies (the New MS) [48]. The available study population consisted of both men and women who had been registered after 1994 at the three participating MS clinics in Greater Montreal, Quebec, Canada. The study was approved by all regional ethics committees. Inclusion criteria for the study were diagnosis of MS or Clinically Isolated Syndrome (CIS) after 1994. From a pool of 5000 patients, a centre-stratified random sample of 550 patients was drawn, of which 394 were contacted. From those who were contacted, the first 192 persons who responded were enrolled, 189 completed all questionnaires and 185 came for an interview. Respondents and non-respondents were compared and no clinically or statistically significant differences were found between the two groups on socio-demographic characteristics.

Measurement

Patient generated index

The PGI is an individualized measure of HRQL that was administered in three stages. In the first stage, patients were asked to identify up to five of the most important areas of their lives affected by MS. In the second stage, patients were asked to rate how badly affected they were in each of the selected areas on a scale of 0 to 10, where 0 was the worst they can imagine and 10 exactly as they would like to be. A sixth box was provided to rate all other health or non-health related areas. In the third stage, they were given twelve spending “points” or “tokens” to distribute among the areas identified. The tokens that they allocated to each area represented the relative importance of potential improvements in the chosen area. The more tokens a patient spent for an area, the more important that area was. The less tokens a patient spent, the less important that area was. The rating for each area was multiplied by the proportion of “points” for that area, which were then summed together to produce an index from 0 to 100 [49]. For ease of comparison with the utility measures, PGI scores in this study were presented on a scale from 0 to 1.

EQ-5D

The EQ-5D is a generic preference-based measure of HRQL that consists of two parts [50, 51]. The first part includes 5 separate domains; mobility, capacity for self-care, conduct of usual activities, pain/discomfort and anxiety/depression. Each domain has 3 levels: no problems, some problems, extreme problems. The second part consists of a Visual Analogue Scale (EQVAS) to measure self-perceived health on a vertical scale from 0 to 100, where 0 is the worst imaginable health state, and 10 is the best imaginable health state. The EQ-5D defines 243 health states, and has a range from −0.6 to 1.0.

SF-6D

The SF-6D is a generic preference-based measure derived from the SF-36 Health Survey (or RAND-36) [23, 39]. The SF-6D has 6 domains: physical functioning, role limitation, social functioning, pain, mental health and vitality. Each domain has between 4 and 6 levels. The index defines 18 000 health states, and has a range from 0.3 to 1.0.

Procedure

Figure 1 presents a flowchart of the study procedure.

Figure 1
figure 1

Flowchart of the study procedure.

Subjects were first interviewed on an individualized measure of QOL, the PGI [49]. The domains identified with the PGI were then classified and grouped together using the World Health Organization’s International Classification of Functioning, Disability and Health (ICF) [52] independently by four raters. This methodology followed closely that conducted by Mayo et al [53], which evaluated the extent to which HRQL measures captured constructs beyond symptoms and function. The ICF provided a coding framework and standardized description of health related problems at the level of body structure/function (e.g. fatigue, cognition), activity (e.g. dressing, feeding, walking) and participation (e.g. school, work). These levels are also known as impairments, activity limitations and participation restrictions, respectively. Any discrepancies between raters were resolved by discussion.

Last, the domains were mapped onto the HUI2, HUI3, EQ-5D and SF-6D which had been previously mapped to the ICF [53]. The extent to which these utility measures captured domains important to patients with MS was qualitatively appraised.

Data analysis

We had data on hand for the PGI, the EQ-5D and the SF-6D (derived from the RAND-36). As all three measures were administered on the same individual, generalized estimating equations (GEE) was used to adjust the variance for the clusters of outcome within persons. The advantage of using GEE, as opposed to the paired t-test, was that it allowed for simultaneous assessment and correlation among all 3 measures. The regression coefficients produced in the model were estimates of the difference between measures (with 95% CI) adjusted for the correlation among data points. An effect size (ES) was then calculated using the t-statistic, which was equal to the adjusted regression coefficient divided by its SE.

Results

A total of 185 persons with MS were interviewed on the PGI. The sample was relatively young (mean age 43) and predominantly female. Both men and women had mild disability with a median Expanded Disability Status Scale (EDSS) score of 2. The average number of years since diagnosis was 6 years, and 59% of the sample was on Disease Modifying Therapies. Demographic and clinical characteristics are presented in Table  1.

Table 1 Demographic and clinical characteristics of sample (n = 185)

Table  2 presents the top 10 domains that patients identified to be the most affected by their MS. These areas were, work (62%), fatigue (48%), sports (39%), social life (28%), relationships (23%), walking/mobility (22%), cognition (21%), balance (14%), housework (12%) and mood (11%). The mean impact score for each domain (from 0 to 10) ranged from 3.9 to 5.0. In terms of the mean number of points spent for each domain, patients spent the most points (4.3) to improve their relationships, followed by fatigue (3.8) and then walking (mean 3.6).

Table 2 Top 10 domains identified by subjects using the Patient Generated Index

Table  3 presents the results for the mapping of the 10 domains identified by MS patients against the HUI2, HUI3, EQ-5D and the SF-6D. School/work was found in the EQ-5D and SF-6D but not in the HUI2 or HUI3. Fatigue was found in the SF-6D but not in the EQ-5D or the HUI measures. Sports which was the third most frequently reported domain, was only found in the SF-6D and HUI2. Social life was included in the EQ-5D and the SF-6D, but not in the HUI measures. Cognition was found in the HUI measures, but not in the EQ-5D or the SF-6D. Housework was included in the EQ-5D and the SF-6D, but not in the HUI2 or HUI3. Relationships and balance were not included in any of the utility measures. Mood was the only domain that was included in all of the measures.

Table 3 The domains identified by MS subjects compared with items in generic utility measures

The SF-6D included the most number of domains (6 domains) important to people with MS, followed by the EQ-5D (4 domains) and the HUI2 (4 domains), and then the HUI3 (3 domains).

The generic utility measures included domains that were not identified to be important by the sample, such as pain, self-care, vision, hearing, manual dexterity, speech and fertility.

The correlation between the SF-6D and the EQ-5D was 0.58. As demonstrated in Figure 2a, although the relationship between the measures was somewhat linear, discrepancies in scores between the two measures was evident. At the upper end of the scales, a number of individuals who had utility scores of 0.85 on the EQ-5D had scores as low as 0.6 on the SF-6D. A clinically meaningful difference on utility measures is 0.03, indicating that the difference in scores between the two utility measures was important. Discrepancies were also observed at the lower end of the scale, where an individual with a score of 0.12 on the EQ-5D had a score of 0.55 on the SF-6D.

Figure 2
figure 2

Relationship between the EQ-5D, the SF-6D and the Patient Generated Index. a: Scatter plot of the relationship between the EQ-5D and the SF-6D. b: Scatter plot of the relationship between the Patient Generated Index and the EQ-5D. c: Scatter plot of the relationship between the Patient Generated Index and the SF-6D.

The correlation between the PGI and the EQ-5D was 0.53. As presented in Figure 2b there were important discrepancies in scores between the two measures. Several individuals with very low scores on the PGI (as low as 0.1) had very high scores on the EQ-5D (as high as 0.8). For many individuals, there was also a mismatch between scores obtained using the PGI and those obtained with the EQ-5D (i.e. individuals with scores as low as 0.1 on the PGI had scores of 0.8 on the EQ-5D). Pearson’s correlation between the PGI and the SF-6D was 0.53. Similar to what was observed for the EQ-5D; there were discrepancies in scores between the 2 measures, particularly towards the lower end of the scales (Figure 2c).

The impact of a mismatch between domains provided in the generic utility measures and those that are important to people with MS is illustrated by the total scores of the measures. As seen in Figure 3, the mean and standard deviation (SD) for the PGI, EQ-5D and the SF-6D were 0.50 (SD 0.25), 0.69 (SD 0.18) and 0.69 (SD 0.13), respectively. The magnitude of difference between the PGI and the 2 utility measures was 0.19 (95% CI 0.16 to 0.22) with ES equal to 12.

Figure 3
figure 3

Mean and standard deviation values for the PGI, EQ-5D and SF-6D, with differences and 95% CI calculated using generalized estimating equations.

This mismatch was also present at the item level. A total of 41 subjects (22% of the sample) reported walking to be an important aspect of their QOL. The distribution of scores on the degree to which walking was affected for these subjects is presented in Figure 4. The impact was measured on a scale from 0 to 10 on the PGI, where 0 was the worst they could imagine and 10 was exactly as they would like to be. These scores were compared with the responses on the EQ-5D mobility item. 12 subjects out of 41 reported having no problems with walking on the EQ-5D. These people were expected to have a score of 10 on the PGI. Only 1 person reported a score of 10 on the PGI. All other subjects reported scores lower than this, scores as low as 3 (poor).

Figure 4
figure 4

Frequency and distribution of PGI scores on the degree to which walking was affected from 0 (worst they can imagine) to 10 (exactly as they would like to be).

Discussion

In this study, subjects with MS were interviewed on an individualized measure to evaluate the impact of the disease on their QOL. The results of the interview generated a list of domains that were most important to the QOL of persons with MS. The domains identified were work, fatigue, sports, social life, relationships, walking, cognition, balance, housework and mood. These were then mapped onto generic utility measures to estimate the extent to which they captured domains that were important to persons with MS.

There was no one generic utility measure that captured all of the domains important to persons with MS. For example, fatigue, which affects 75 to 90% of patients with MS [5457] was not included in the EQ-5D or the HUI measures. Walking, another commonly reported symptom was not found in the SF-6D. Cognition was not found in the EQ-5D or the SF-6D. Work, sports, and social life were not found in the HUI2 or HUI3. This was not surprising as the HUI measures were developed with the intention of evaluating ‘within-the-skin’ experiences that excluded social interaction [5860]. Balance and relationships were not included in any of the utility measures.

The generic utility measures were clearly missing domains that were important to people with MS. Out of the 10 domains that persons with MS identified as being central to their QOL, only 3 of them were included in the HUI2, 4 were included in the HUI3, 4 were included in the EQ-5D and 6 were included in the SF-6D. Furthermore, the generic utility measures included several domains that were not important to persons which were sampled in the study, such as pain, self-care, hearing and manual dexterity.

To tackle the issue of lack of content validity, one emerging area of interest in the literature is the development of disease specific “bolt-ons” or dimension extensions to generic utility measures [51]. Another emerging area of interest is the development of disease-specific utility measures, which have been developed for stroke [61], pulmonary hypertension [62], asthma [63], rhinitis [64], urinary incontinence [65] and erectile dysfunction [66]. Recently, Versteegh et al. [67] derived a MS specific utility measure from the Multiple Sclerosis Impact Scale-29 (MSIS-29) using Rasch analysis. The authors selected 8 out of 29 items from the original questionnaire. Some important dimensions such as social life, work and mood were included while others such as walking, sports and physical fatigue were omitted.

There are several potential benefits to using disease specific utility measures in clinical and cost-effectiveness research. First, disease specific utility measures are designed to include domains that are specific to a disease, and therefore, are likely to be more sensitive to smaller change over time than generic measures. Second, not only do these measures provide descriptive information on the various dimensions of health, but also provide a value for each one, thus allowing trade-offs to be made between the domains. Disease-specific utility measures serve the potential to overcome one of the challenges associated with disease specific health profiles - that domains cannot be combined into a single index, which makes it difficult to conclude whether an intervention was effective or not. For example, if a treatment has a positive effect on physical health but a negative one on mental health, unless we know the relative importance attached to each domain, it is impossible to determine whether the intervention resulted in a net improvement or decline in QOL/HRQL. Furthermore, disease-specific utility measures can be used to calculate QALYs and make decisions on the cost-effectiveness of different treatments in MS.

A clinician reported outcome (ClinRO) is an assessment of the status of a patient’s health condition that is made by an observer with professional training (i.e. clinician) [69]. ClinRO are commonly used for endpoints that cannot be directly measured by the patient (e.g. EDSS to quantify level of disability in MS). An observer-reported outcome (ObsRO) is an assessment that is made by an observer without professional training (i.e. non-clinician observer such as a teacher or caregiver) [69]. This type of evaluation is typically used when the patient is unable to self-report. A patient reported outcome (PRO) is any report of the status of a patient’s health condition that comes directly from the patient, without interpretation of the patient’s response by a clinician or other observer (e.g. symptoms, QOL, HRQL) [68, 69]. PROs play a complementary role in outcome assessment by providing evidence on the benefit or harm of a treatment from the patient’s perspective. Utility measures are one type of PRO. In outcome assessment, utility measures not only provide information on the benefits and harms of a treatment, but are also useful for economic applications by producing QALYS. This information can provide policy and decision makers with a means of evaluating the costs and cost-effectiveness of different treatment options for a health condition.

The first step in evaluating the validity of scores produced by a PRO is an assessment of content validity, before any other forms of validity (i.e. construct validity) are undertaken. Content validity of a PRO can be judged only by the individuals or populations being assessed (i.e. the patients themselves). The global aim of this study was to address this very question of whether generic utility measures captured domains that were important or relevant to people with MS. The results of this study suggest that many important domains in MS are not captured by generic utility measures, therefore questioning the content validity of such measures in MS. This in turn, adds doubt to the interpretability or meaningfulness of scores produced by these measures for this population.

It is important to target measures to people to ensure that the impact of a disease and its treatment are adequately and reliably captured in a clinical trial [70, 71]. If a PRO includes domains that are not impacted upon by the disease or its treatment, it will not be able to capture clinically meaningful change. By targeting to the disease, measures are more likely to be sensitive to small but important clinical changes. Furthermore, the ability of PROs to detect small changes is important in determining the statistical power or the necessary sample size required for a clinical trial [72].

The results of our study revealed that the commonly used 4 generic utility measures (HUI2, HUI3, EQ-5D and SF-6D) do not capture the majority of domains important to MS. Among these generic measures, the SF-6D captured the most number of domains (6 domains) that were important to MS. Our findings suggest that the SF-6D, compared to the other generic utility measures, may be the most appropriate one to use in MS. The PGI index can be used to evaluate the clinical effectiveness of different interventions in MS. However, because the PGI was not developed using multi-attribute utility theory (hence is not a utility measure); it cannot be used for cost-utility analysis.

Ideas for future directions that build directly from this work are the use of MS specific “bolt-on” items or dimensions to generic utility measures [73]. This study has identified potential items important to MS, such as fatigue that can be used as add-ons to existing generic utility measures. Other areas of potential research that can build directly from this work are the development of an MS specific utility measure that will only include dimensions pertinent to the disease.

A particular feature of this study is that we purposely sampled people with MS diagnosed in the era of Magnetic Resonance Imaging (MRI) technology and availability of disease modifying drugs [48]. As these are the people who are faced with treatment decisions, a method of valuing changes on the most important domains of QOL affected by MS would be the most relevant for this population.

Conclusions

Generic utility measures are designed to include a common set of dimensions that most people will value highly, therefore underrepresenting those dimensions that may be specific to a particular disease. Although the generic utility measures included certain items that were important to people with MS, there were several that were missing. An important consequence of this mismatch was that values of QOL derived from the PGI were importantly and significantly lower than those estimated using any of the generic utility measures. This could have a substantial impact for evaluating the effect of interventions in people with MS. The overestimation in scores obtained with utility measures may not have an impact at the start of a clinical trial, but they will have an impact at follow-up. If scores are high at baseline, there will likely be no room for improvement on the scale, resulting in the false conclusion that the treatment group did not change post-treatment. When in reality, the treatment may have had a positive effect but the measure being administered was not able to detect this. Then the difference between the treatment and control group (assuming the control also does not change), would be zero. In addition, an intervention that is in fact beneficial to fatigue, for example, would also risk not to show change on a generic measure because this item was not included. When choosing the right outcome measure for an intervention, it is essential to choose one with items that can or should be affected by the intervention. Given that the MS specific items do impact on QOL, not including these items would result in a false estimate of QALYs and bias the evaluation of the cost-effectiveness of interventions in MS.

Authors’ information

AK is a physiotherapist and a PhD Candidate in Rehabilitation Sciences at the School of Physical and Occupational Therapy, McGill University. NM is a James McGill Professor in the Department of Medicine and the School of Physical and Occupational Therapy, McGill University.