The EuroQol Group task force recommended that English and Spanish versions be developed in parallel, where they could also serve as root languages for further translations and adaptations of the expanded version.
The study consisted of two phases. In the first phase, carried out from June to November 2007, a pool of potential labels for the new levels was identified and provisional labels for the 5-level version were chosen from that pool after a response scaling task carried out in face-to-face interviews with convenience samples of lay respondents. In the second phase, carried out from May to July 2008, face and content validity of two alternative 5-level systems were tested in focus group sessions with healthy participants and those with chronic illness. The second phase was also used to test the face validity of a series of health states based on the 5-level versions. Different groups of respondents were used in the two phases of the study.
Participants in both phases were recruited to ensure a wide range of socio-demographic characteristics. For the response scaling phase, the UK participants were recruited via local newspaper advertisements, local community advertisements, and from an existing participant database. The Spanish participants were recruited from among parents from local schools and from patient associations. Patient focus groups included primarily individuals with arthritis, diabetes, or asthma. In all groups, adequate written and oral fluency in English or Spanish was required.
Written informed consent to participate was obtained from all participants in both phases of the study.
Phase 1: response scaling
Potential labels for the EQ-5D-5L were identified from a review of existing health-related quality-of-life instruments, a review of the literature on response scaling, hand searching of dictionaries and thesauruses, and informal interviews with native speakers of the target languages to establish how they described different severities of health problems. The same process was carried out in English and Spanish and, where possible, equivalent terms were sought in both languages. Labels included in the initial pool clearly had to fit with the lexical structure used in the EQ-5D-3L, such as ‘I have no problems doing my usual activities’ and ‘I have some problems doing my usual activities’.
In order to select labels from the pool for the new levels, an interviewer-administered response scaling exercise similar to those used in previous studies [14, 19, 20] was adopted to estimate the severity represented by each label. For this exercise, respondents were shown a rating scale in the form of a vertical, hash-marked, 40 cm visual analog scale (VAS) with end points of 0 and 100 to be used as a visual aid in grading label severity. For the Mobility, Self-Care and Usual Activities dimensions, the same set of labels was used. The interviewer placed a card labeled ‘No problems’, ‘No pain/discomfort’, or ‘No anxiety/depression’ as appropriate at the bottom of the scale (0) to act as the lower anchor and a card labeled ‘Unable to, ‘The worst pain or discomfort I can imagine’, ‘As anxious or depressed as I can imagine’ as the upper anchor (100). The respondent was then shown other labels from the pool singly in a quasi-random order and asked to assign a score between 0 and 100 to indicate label severity in relation to the lower and upper anchors.
The interviewer noted all scores, and when the respondent had rated all labels for a particular dimension, the interviewer laid them out in rank order alongside the VAS and asked the respondent to review the ranking and make any changes he or she thought necessary. If labels were reordered at this point, the respondent was asked to assign a new score to the relevant labels. Final scores assigned were recorded in an answer booklet. The scaling task was repeated for each dimension. Before finishing with the cards, the respondent was asked whether any of the labels sounded unusual, or should not be used in relation to a particular dimension.
Respondents rated labels for all five dimensions. The three functional dimensions (Mobility, Self-Care and Usual Activities) were always interspersed by the Pain/Discomfort and Anxiety/Depression dimensions, so that the respondent did not rate the same label types consecutively. Before rating the actual labels, respondents performed a practice task based on levels of overall health to get used to the study requirements. Data on age, level of education, main activity, and use of any current treatment for health problems, together with the existing EQ-5D-3L descriptive system and EQ-VAS, were collected after the response scaling task.
Before the main response scaling task, a pilot test was performed to test study procedures and materials. Based on the results of the pilot study, some labels were eliminated from the initial pool to achieve a more manageable number for the response scaling task. In particular, any labels using additional modifiers such as ‘very’ or ‘quite’ were eliminated as were any that were considered excessively colloquial or too high a level of language. After pilot testing, it was concluded that the feasible limit was about 10–12 labels per dimension for an individual respondent.
Responses to the scaling task were analyzed by calculating means and medians and the corresponding standard deviations and interquartile ranges (IQR). Labels to go forward for further testing were selected based on criteria that had been identified before data collection started. These included selecting labels close to or at the 25th, 50th, and 75th centiles on the VAS, ensuring consistency across dimensions and coherence with wording in the descriptive system. No quantitative comparison of label scores was carried out in deciding which labels to carry forward to the next stage; median scores were simply used as a guide to determine which labels fell closest to the 25th, 50th, and 75th centiles. Labels were also required to be in colloquial language. The choice of labels and their appropriateness was discussed by the task force at several meetings during the course of the study.
Phase 2: testing the face and content validity of alternative 5-level versions
The results of the response scaling task led to an intermediary result of two, rather than one, alternative 5-level versions in both UK English and Spanish (for an explanation, see Results). The second part of the study aimed to assess the ease of use, comprehension, interpretation, and acceptability of these two versions and to use these results to decide on a final, definitive version for validation work. A further aim of this part of the study was to evaluate the face validity of some hypothetical health states generated by the 5-level descriptive systems. To this purpose, the two alternative versions were tested in 8 focus groups in each country (total of 16 groups); four of these were composed of healthy participants and four under treatment for a health condition.
Groups were led by an experienced moderator, and sessions were audio-recorded and transcribed for analysis. A previously prepared script was followed in all groups. All participants in each group first completed either Alternative 1 or Alternative 2 of the EQ-5D-5L (depending on the group they were assigned to), followed by the EQ-VAS. Participants were then asked to review their answers and what they had thought about while they completed the survey. Further questions were used to probe their reactions to the questionnaire in more detail, particularly their reactions to the severity labels used. Participants then provided socio-demographic information before being asked to complete the complementary Alternative 2 or Alternative 1, again on their own, after which there was further group discussion on their reactions. At the end, participants were asked their preferences for the alternative descriptive systems. The order of administration of versions 1 and 2 was alternated between the groups to control for possible ordering effects, and groups were assigned randomly to the different orders.
In the final stage of the focus groups, participants discussed a set of hypothetical health states produced by combining different levels from the 5 dimensions using the alternative 5-level versions. Examples of the health states tested are shown in Table 1. Participants reviewed the states and were asked to assess them for face validity, interpretability, and plausibility. The same procedures were used in the remaining groups, though the order in which the alternative versions of the questionnaire were administered was reversed.
The focus groups were run using a structured ‘script’ or guide, so the analysis was based initially on grouping and contrasting participant statements relating to each of the specific issues addressed. Thematic content analysis  was used to explore issues in more depth and to examine the transcripts for other, non-scripted statements and expressions.