EQ-5D-Y-5L: developing a revised EQ-5D-Y with increased response categories

Purpose EQ-5D-Y is a generic measure of health status for children and adolescents aged 8–15 years. Originally, it has three levels of severity in each dimension (3L). This study aimed to develop a descriptive system of EQ-5D-Y with an increased number of severity levels and to test comprehensibility and feasibility. Methods The study was conducted in Germany, Spain, Sweden and the UK. In Phase 1, a review of existing instruments and focus group interviews were carried out to create a pool of possible labels for a modified severity classification. Participants aged 8–15 rated the severity of the identified labels in individual sorting and response scaling interviews. In Phase 2, preliminary 4L and 5L versions were constructed for further testing in cognitive interviews with healthy participants aged 8–15 years and children receiving treatment for a health condition. Results In Phase 1, a total of 233 labels was generated, ranging from 37 (UK) to 79 labels (Germany). Out of these, 7 to 16 possible labels for each dimension in the different languages were rated in 255 sorting and response scaling interviews. Labels covered an appropriate range of severity on the health continuum in all countries. In Phase 2, the 5L version was generally preferred (by 68–88% of the participants per country) over the 4L version. Conclusions This multinational study has provided a version of the EQ-5D-Y with 5 severity levels in each dimension. This extended version (EQ-5D-Y-5L) requires testing its psychometric properties and its performance compared to that of the original EQ-5D-Y-3L. Electronic supplementary material The online version of this article (10.1007/s11136-019-02115-x) contains supplementary material, which is available to authorized users.


Introduction
Since 2010, the EQ-5D-Y has been available as a 'Youth' version of the EQ-5D, for children and adolescents aged 8-15 years. The instrument was developed using the standard three-level (3L) format of the EQ-5D descriptive system (adult version). Like the EQ-5D, the current EQ-5D-Y-3L descriptive system comprises five dimensions of health; 'mobility', 'looking after myself', 'doing usual activities', 'having pain or discomfort', 'feeling worried, sad or unhappy'. Each dimension has three levels of severity, resulting in a total of 243 possible health states. Although the same dimension and response option structure as used in the EQ-5D-3L was retained, the wording and layout were modified to be suitable for children and adolescents [1][2][3]. The EQ-5D-Y-3L has demonstrated its feasibility in children and adolescents with different health conditions [4][5][6].
In 2011, the EQ-5D-5L, a five-level (5L) version of the EQ-5D for adults, was introduced, with the aim of reducing the instrument's ceiling effects and enhancing sensitivity, especially in milder health conditions [7]. Testing of the 5L adult version has shown that it works as well or better than the 3L in various conditions and settings [8][9][10].
As with the 3L adult version, there is evidence of ceiling effects for the EQ-5D-Y-3L and it has been criticized for being overly simplistic and potentially insensitive to small changes in health status [6,11,12]. In contrast to the EQ-5D-Y-3L, the majority of generic health-related quality of life (HRQoL) instruments for children and adolescents, such as the KINDL or PedQL [13,14], use response options with more than three levels of severity. Expanding the number of severity levels in each dimension of the EQ-5D-Y might help to reduce ceiling effects and improve sensitivity.
The aim of the present study was to develop a descriptive system of the EQ-5D-Y with an increased number of severity levels and to test the comprehensibility and feasibility of the extended version. The number of levels in the final version was not defined a priori as a further aim of the study was to assess children's opinion and the acceptability of using versions with 4 or 5 levels of severity in each dimension.

Methods
This study was conducted in Germany, Spain, Sweden and the UK between May 2014 and June 2018. Ethical approval was obtained in each country. The study had two phases. In phase 1, potential severity labels were identified, sorting and response scaling interviews were conducted, and alternative 4L and 5L versions of EQ-5D-Y were developed for each country. In phase 2, both versions were tested for comprehensibility and feasibility and children's opinion about the two versions were elicited. In both phases, a common standardized protocol was used to ensure that the same procedures were followed in each country.

Phase 1
Identifying a pool of labels for the severity levels Procedure A pool of possible labels for the extension of EQ-5D-Y-3L severity levels was developed from a review of HRQoL instruments and by focus group interviews. Existing generic as well as disease-specific HRQoL instruments for children in the four different languages were included with the aim of identifying labels which covered the full range of severity. Dictionaries and thesauruses were used to search for synonyms of previously identified labels. When deciding which labels to include in the final pool, the lexical structure of the EQ-5D-Y was taken into account. Only labels describing the 'quantity' or 'intensity' of health problems (e.g., a lot of, slight) were included and, e.g., terms relating to frequencies, were excluded.
In addition, each country conducted two focus group interviews, one with children aged 8-10 and one with participants aged 11-15, as it was assumed that younger children would be more willing to participate and less shy in a separate group. Children without any obvious health problems were drawn from the general population through collaboration with local schools and sports clubs. To participate, the relevant local language had to be the main one spoken in the participant's home.
The aim of the focus group interviews was to identify child-friendly labels normally used by the target group. In general, focus group interviews have predominantly been conducted with adults [15]. However, there is evidence that these are feasible with younger participants for use in the development of child-specific HRQoL instruments to gather information about the wording and vocabulary of children and adolescents themselves [15,16].
Children and adolescents were first asked to talk about their own experiences with illness before being asked to describe pictures illustrating people with a health condition. This procedure aimed to identify words and phrases that children and adolescents used naturally and spontaneously when talking about health and illness. We were particularly interested in words young people used to describe the quantity or intensity of health problems. Subsequently, children were asked to rank the labels elicited from the earlier review of instruments between no problems to the most severe problems they could imagine. Lastly, the participants indicated the labels that they did not like or understand as well as those they liked the most.

Data analysis
The focus group discussions were documented by detailed notes or recorded and transcribed and then analyzed using thematic content analysis [17]. As typical for this kind of analysis, we defined categories (e.g., 'labels mentioned by the participants themselves', 'information about the labels in the context of the ranking') and screened the participants' comments for statements referring to these categories. Based on the results from the review and the focus groups, each country identified a pool of potential labels for use with each dimension of the EQ-5D-Y.

Sorting and response scaling interviews
Procedure Sorting and response scaling techniques were used in individual interviews with children and adolescents to determine the relative severity of each label identified in phase I. Sorting tasks were used with younger children (8-10 years) while older children (11-15 years) completed response scaling tasks.
A convenience sample of children and adolescents aged 8-15 years from the general population of school children recruited in primary and secondary schools was used. Different types of schools were included to ensure the participation of children and adolescents with different educational levels and socio-economic background. A total of 60 participants in each country was expected with 20 in each age group: 8-10 years, 11-2 years and 13-15 years. The sample size was somewhat larger than that used in developing the adult version EQ-5D-5L [7], as the surveyed population of children and adolescents was considered to be more heterogeneous in terms of age and verbal comprehension.
In the context of the development of HRQoL instruments, the response scaling method typically requires participants to assign a numeric rating to an item or label. This method has already been used in the development of other HRQoL instruments [18,19]. In this study, older respondents (11-15 years) were asked to rate the severity of each label separately on a visual analog scale (VAS) from 0 to 10 (detailed labeling of the anchors see legend of Table 4). For younger children (8-10 years), the different categories of severity were presented on a 'smiling face' scale. Smiley faces are often used for child-friendly measures [20,21]. We used a modified version of the faces scale from the UK Household Longitudinal Study [22]. For each label, the younger children were asked to choose one smiley out of five smileys (from 1 = smiley of very bad mood to 5 = smiley of very good mood). The anchors of the scale, so smiley 1 and smiley 5, were labelled in the same way as done for the anchors of the VAS that was used for the older participants.
All children rated all labels separately for each dimension. Participants were asked to indicate labels that they found hard to understand or which they did not use in daily language. Both the order in which dimensions and labels were presented to the participants were randomized to avoid bias. A pilot test of the tasks was conducted before being applied to the full sample.
Data analysis Labels were first grouped into two categories ('unusual and unclear labels'; 'usual and easily understood labels') based on participants' comments. Mean (standard deviation), median, mode, minimum and maximum of the sorting and response scaling data were then computed for all labels using SPSS version 23. These analyses were done separately for younger and older participants as the scale for the younger participants ranged from 1 to 5 while it was 0 to 10 for the older participants.
Criteria for label selection Labels were selected for further testing based primarily on their distribution along the severity continuum. As both 4L and 5L formats were being considered, two sets of criteria were specified as shown in Table 1.
Labels were considered appropriate for the extended versions, if the following criteria were met (ordered from most to least importance): (1) median and mode showed exactly the previously defined values (Table 1) or they were close to it, (2) median and mode had the same value, (3) standard deviation was very small, as that would show similarity of interpretation among respondents. The labels for the upper ('unable to', 'extreme pain or discomfort', 'extremely worried, sad or unhappy') and lower ('no problems', 'no pain or discomfort', 'not worried, sad or unhappy') levels of severity were selected as used for the anchors in the sorting and VAS from 0 to 10; 4 labels needed 0-3.4-6.8-10 5L 8-10 5 smilies; 5 labels needed 1-2-3-4-5 11-15 VAS from 0 to 10; 5 labels needed 0-2.5-5-7.5-10 response scaling tasks as these showed good comprehensibility and feasibility.
If there were uncertainties about the final decision for a label and more than one label was appropriate for a severity level based on the quantitative results, the results of the qualitative data analysis were taken into account when making the final decision. At the end of this phase, draft 4L and 5L versions were available.

Cognitive interviews
Procedure Cognitive interviews were conducted to test the 4L and 5L draft versions for comprehensibility, feasibility and preferences between versions. In Germany, Spain and Sweden, healthy children and adolescents aged 8-15 years as well as those in treatment for a health condition participated in individual or group interviews. Healthy participants were recruited in collaboration with schools and participants with a health condition in collaboration with local hospitals. The interviews took place in a separate room assigned by the schools or hospitals. Participants with a health condition were included to get feedback from children who might use labels representing higher levels of severity.
According to the standardized protocol, participants first completed either the 4L or 5L to record their own health status, followed by a general discussion of the version. Participants then completed socio-demographic questions, before completing the other draft version and discussing that. Finally, they were asked which version they preferred and why. To avoid an ordering bias, the order of versions was varied. When discussing the versions, the paraphrasing method was used, whereby participants were asked to rephrase the items in their own words; probing was used to explore problems in answering, comprehension, and participants' reasons for choosing a given response option [23].
In the UK, a slightly different approach was taken. Two focus group interviews were conducted to test the provisional 5L version. Pupils from primary and secondary schools participated. Recruitment of children with current experience of illness would have necessitated obtaining separate ethical approval from the National Health Service (NHS). This would have incurred significant delay so that recruitment was limited to children attending schools. Participants were initially asked to record their current health status using the 5L version and then to review the first page and to circle any words or phrases that might be difficult to understand for other people of their age. These words were discussed in the group. In addition, the participants reported how hard or easy they found it to answer the version. Finally, each pupil was asked to complete a written task designed to test their comprehension of key words and phrases that were considered to be the most problematic.

Data analysis
The interviews and group discussions were recorded, transcribed and analyzed using thematic content analysis [17]. Comments made by participants were assigned to defined categories such as 'general comprehensibility and ease of use', 'comprehensibility of labels', and 'suggestions for changes'.

Harmonization
As we wanted to develop language-specific versions from scratch, we did not expect to find absolute equivalent labels in all countries. However, the three 5L language versions (Swedish, Spanish, German) were translated into English and compared to each other and to the UK English version. Any discrepancies between versions were discussed in a harmonization exercise involving researchers from each country.

Identifying a pool of possible labels for the severity levels
The review of HRQoL instruments and focus groups identified potentially usable labels (Germany: 79; Spain: 67; Sweden: 50; UK: 37) from which a smaller number of labels per dimension was selected for inclusion in the sorting and response scaling interviews ( Table 2). During the screening of HRQoL instruments to select some candidate labels, the UK team was quite strict with regard to whether the labels were grammatically well compatible with the general format of EQ-5D dimension statements and whether they seemed to be child-friendly. This led to a smaller initial label pool than in other countries. The German label pool was quite big due to a complicated language that offered several options of wording. The German team wanted to give the chance to children and adolescents to give their view on many different labels. Labels representing the full range of severity were included in all countries. The same set of labels was applied for the 'mobility', 'looking after myself' and 'usual activities' dimensions and a somewhat different set for the 'having pain or discomfort' and 'feeling worried, sad or unhappy' dimensions.

Sorting and response scaling interviews
Each country conducted between 59 and 72 sorting and response scaling interviews giving a total of 255 interviews.
Detailed information about the sample characteristics are shown in Table 3. Table 4 shows the range of median values for the 'mobility' dimension based on the responses of the participants aged 11-15 years, while Table 5 provides the same information for the 'feeling worried, sad and unhappy' dimension. These are provided as examples as the range of values for the other dimensions was similar. Labels covered an appropriate range of severity on the health continuum in all countries. The ratings from the participants aged 8-10 years were comparable to those from the older participants.
Median values, mode and standard deviation were considered in the decision regarding final labels for the 4L and 5L versions, as well as participants' verbal statements. For example, the Swedish labels 'pyttelite' (a tiny bit) and 'något' (some, somewhat) for dimensions 'mobility', 'looking after myself' and 'doing usual activities' were ranked differently among the two age groups, who also appeared to interpret the words in different ways. Hence, these words were not chosen as final labels. The importance of the verbal statements was seen in Germany, where the label 'ein bisschen' (somewhat/a bit) was chosen for level 2 in the 'feeling worried, sad or unhappy' dimension. Based on the values given for the labels, 'leicht' (slightly) and 'ein wenig' (a few/a bit) were also possible options. However, participants mentioned that they would not use these words in everyday language in the context of being worried, sad or unhappy, so it was decided not to use them for this dimension.

Cognitive interviews to test comprehension and feasibility of the extended 4L and 5L versions
Sample characteristics for participants in phase 2 are shown in Table 6. Participants' comments indicated that they found both versions, EQ-5D-Y-4L and EQ-5D-Y-5L 1 , to be feasible to complete and easily understood"…it wasn't hard to complete and there were no difficult words" [boy, 12 years, Sweden]. In Germany and Sweden, no questions were raised about the labels for the severity levels or the general questionnaires, among either the younger or older participants. In Spain, some of the labels which were chosen after the sorting and response scaling exercises caused problems for Table 5 Median VAS for the labels rated for the 'feeling worried, sad or unhappy' dimension in the response scaling interviews by participants aged 11-15 years, by country a VAS from 0 to 10 was used. Anchor '0' was labelled as "not" and anchor '10' was labelled as 'extremely' b VAS from 0 to 100 was used. Anchor '0' indicated the best status and anchor '100' indicated the worst status c VAS from 0 to 100 was used. Anchor '0' was labelled as 'extremely' and anchor '100' was labelled as 'not'  the participants. For example, in the 'mobility', 'looking after myself', and 'usual activities' dimensions, level 2 'un poco de problema' (a little bit of a problem) was changed to a more natural-sounding wording ('algún pequeño problema'). Some of the younger participants were also unsure how to interpret 'moderados' and 'moderadamente' (moderate) and, after discussion between the researchers and children it was decided to use 'bastante' (quite a lot) as the alternative which was closest and easiest to understand. Other changes included replacing 'muchísimos' (used in the 'mobility', 'looking after myself', 'doing usual activities' and 'pain and discomfort' dimensions) with the more childfriendly terms 'muchos' or 'mucho' (a lot) and replacing 'algo' (somewhat) with 'un poco' (a little) in the 'worried, sad or unhappy' dimension. When directly asked for their opinion, the majority of the participants in Germany, Spain and Sweden, irrespective of their health status, preferred the EQ-5D-Y-5L (Germany: 88%; Sweden: 66%; Spain: 68%) over the 4L version. They felt it allowed them to rate their health in more detail. They commented that the 5L version is more precise, and they liked the fact that it has a middle answer category. In Sweden, one respondent stated 'I thought the 5L version was best because there were more options to choose from' [boy, 13 years, Sweden]. In Germany, one participant argued: '[…] you are not able to state your current health status [in the 4L version] as precisely as in the 5L version' [girl, 10 years, Germany]. Participants with health problems also noted that the EQ-5D-Y-5L provided more options for reporting severe health problems. Compared to the 5L version, some participants had the feeling that answering the 4L version was more difficult as there were fewer possibilities to choose from. However, two participants commented critically on the central response option in the 5L as they thought it might be used by respondents who were unwilling to decide between answers. However, the central response category of the EQ-5D-Y-3L has been used in previous studies without any evidence of this type of problem [6,11,12].
In the UK, only the EQ-5D-Y-5L was tested and no children reported difficulties in completing it. There were no questions from the participants while answering the questionnaire and no missing data. However, the discussion of the words 'terrible'/'terribly' to describe level 5 in the dimensions 'having pain or discomfort' and 'feeling worried, sad or unhappy' with primary school children indicated the need for further attention as they were especially problematic. When language is embedded in a hierarchical structure then it could be assumed that the 'correct' understanding of a word is implied through its association with adjacent response categories. The label for severity level 5 defines the upper bound and has no scope for such a compensating mechanism. The UK team therefore considered the word 'extreme'/'extremely' as a replacement for 'terrible'/'terribly' since it was used in the other language versions. This was investigated in further interviews with a small number of children (n = 4) who confirmed this substitution.
The final language-specific 5L versions can be seen in Table 7, the 4L versions that were included in the testing can be found in the online resource 1 (Table A1).

Harmonization
The comparison of the 5L versions occasionally showed divergent wordings for the labels. This was primarily due to (1) difficulties finding an exact translation for a term in English or (2) because a specific label was chosen based on participants' comments and therefore justified by the results of the field work. For example, the fourth level of the first three dimensions is 'a lot of' in English and-more or less-also in Swedish but 'große' (great) in German. The two labels are therefore not strictly equivalent but the alternative German wording of 'viele' (a lot of) was more frequently cited by participants in the German cognitive debriefing exercise as being unclear and an unusual wording. The term 'große' was therefore preferred. This also means that the youth version is consistent with the wording used in the German 5L adult version. The discussion of all labels and the slight discrepancies in the different languages showed that the labels were comparable, i.e., labels remained as developed by the national teams.

Discussion
In a process of identifying appropriate labels for an extended version of the EQ-5D-Y and testing the comprehensibility and feasibility, this study was successful in establishing a 5L version of the EQ-5D-Y, the EQ-5D-Y-5L.
It is hoped that the development of this 5L version will lead to an improvement over the EQ-5D-Y-3L in terms of its performance in general and sensitivity in particular. Compared to the EQ-5D-Y-3L, which defines 243 health states, the EQ-5D-Y-5L defines a broader spectrum of 3125 possible health states. However, the 5L version will require further investigation in terms of testing its psychometric properties. The 5L format maintains comparability with the corresponding adult version as do the adult and youth version of EQ-5D-3L. It is anticipated that this will allow continuous measurement of health status over a lifetime and also to permit comparison of results obtained using the two versions of the instrument [11]. This can be important when evaluating the impact of chronic disease which appears in childhood and lasts throughout adulthood [24].
As recommended in guidelines on the development of patient reported outcome (PRO) instruments, we made Jag är extremt orolig, ledsen eller olycklig I am extremely worried, sad or unhappy considerable efforts to take into account the views of the target group when developing the instrument [25][26][27]. Standardized procedures were co-designed across national research teams with children and adolescents being involved in the process at several stages to ensure development of an age appropriate instrument by using their preferred wording and everyday language wherever possible. In phase 1, participants reviewed preexisting labels as well as suggested possible new labels for use in the new version of the questionnaire based on understandable language and everyday speech of children. We found that participants of all ages were able to rate the severity of different labels using a sorting or response scaling task. This study is to the best of our knowledge the first to demonstrate the feasibility of using response scaling tasks in participants as young as eight. Participants contributed actively in phase 2 interviews and freely expressed their opinions about the different wording options offered. Overall, the ability of young persons to participate in studies using scientific methods should not be underestimated. The recruitment of children and adolescents as study participants is a challenge; in this study, it was especially difficult to recruit those with a health condition. Ideally, for the integration of the young peoples' perspective, it is necessary to involve them directly in research. Overall, the applied methods worked well in all countries, although the protocol adopted in the UK and in Spain deviated marginally from that employed elsewhere.
Comparing the EQ-5D-Y-5L and EQ-5D-Y-3L labels shows that the structure was not always changed in a similar way in all countries. Some of the labels from EQ-5D-Y-3L remained, while others were replaced. As it was found in the development of the adult EQ-5D-5L [28,29], it would have been overly simplistic to simply insert an additional level between the original levels 1 and 2 and levels 2 and 3. Hence, it was important to examine different labels for use in the extended versions.
This work on severity labels of the EQ-5D-Y descriptive system is also important in the context of future development of national value sets where the labels have to be valued as part of health state profiles, i.e., without the respondent seeing all severity labels of one dimension and their complete rank order as in the whole descriptive system.
A limitation of our study is that we used convenience samples in all countries and within both study phases; hence, the study population is not representative of the national population in each country. However, by including all age groups, boys and girls, and participants from different types of schools, we tried to ensure the inclusion of children and adolescents with a broad spread of characteristics.
The present study has produced a self-report version of the EQ-5D-Y-5L but future research will be needed to develop proxy-versions of the instrument. In the future, it will be important to conduct validation studies for the different language versions of EQ-5D-Y-5L in different groups of children and adolescents and especially among participants with different health conditions, to identify measurement properties of the instrument. The UK English version is assumed to be the source version for the translation of further language versions. It is also expected that research on valuation of the EQ-5D-Y-3L and EQ-5D-Y-5L will go on.

Conclusion
Children and adolescents in all participating countries contributed to the selection of candidate labels for an enhanced version of the EQ-5D-Y-3L and were able to rate the severity of different labels. They preferred the five-level version of EQ-5D-Y over the proposed four-level alternative. The new EQ-5D-Y-5L was comprehensible and feasible for children and adolescents in the age range 8-15 years and should provide a useful tool for those wishing to incorporate a short, simple, and easy to use measure of health status in their research. Before being used more extensively, further research to test the psychometric performance of the EQ-5D-Y-5L is required as well as an investigation of its feasibility for use in health state valuation exercises.