Psychometric properties of the Japanese version of the EQ-5D-Y by self-report and proxy-report: reliability and construct validity

Purpose This study aimed to assess psychometric properties of the Japanese version of the EQ-5D-Y (3 levels) with a focus on feasibility, reliability, and construct validity. Methods Respondents were recruited from the general populations of three cities in Japan. First, children and adolescents responded to the EQ-5D-Y and PedsQL by self-report. Parents were also asked to evaluate the health states of their children/adolescents using proxy versions of these questionnaires. Next, the EQ-5D-Y was mailed to their residence approximately 2 weeks later, and both children/adolescents and their parents responded to the questionnaire. Reliability was confirmed by self-report test–retest methods and a comparison of self-report responses with proxy responses. Spearman’s correlation coefficients were calculated between responses to the EQ-5D-Y and both responses to and scores of the PedsQL in order to assess construct validity. Results A total of 654 children/adolescents from aged 8 to 15 (median age: 11) responded to the questionnaires at both the first- and second-stage surveys. Test–retest agreement was sufficiently high and was influenced by age. Proxy test–retest results revealed that parents’ responses were more reliable compared to the self-report results. Some correlations (|r| > 0.3) between items of the EQ-5D-Y and PedsQL were found. Meanwhile, no correlations were found between proxy responses to the EQ-5D-Y and self-report responses to the PedsQL. Conclusions The EQ-5D-Y demonstrates reliability and validity among children/adolescents and their parents in Japan. Construct validity of the EQ-5D-Y by self-report was confirmed through comparisons with the PedsQL. Proxy responses to the EQ-5D-Y were more reliable compared to the self-report results, but construct validity was not confirmed in the proxy version.


Introduction
Measurement of health-related quality of life (HRQOL) of children and adolescents is becoming increasingly important for the evaluation of healthcare technologies. The EuroQol Five-Dimensional Questionnaire, Youth Version (EQ-5D-Y) [1], is designed to be a preference-based measure (PBM) that can be used to calculate quality-adjusted life years (QALYs). Some PBMs, including EQ-5D [2][3][4][5][6][7], Health Utilities Index (HUI) 2/3 [8][9][10][11], and Short Form 6 Dimension (SF-6D) [12][13][14][15], have been developed for adults, but only a few measures have been designed specifically for children/adolescents, and include the HUI2 and Child Health Utility-9D [16,17]. Applying PBMs designed for adults to children can be problematic since the vocabulary may not be childfriendly and thus difficult to understand. In Japanese, both kanji (Chinese characters) and hiragana (Japanese characters) are used in sentences, and the Japanese version of the EQ-5D-5L uses both. However, since most young children cannot read or understand all kanji characters used in the questionnaire, developing PBMs tailored to children/adolescents is necessary in order to evaluate accurately their health states by self-completion (self-report).
Confirming the psychometric properties of HRQOL measures for younger people is particularly important, since they may lack the ability to comprehend fully the language and concepts used. This ability is influenced not only by age, but also social environment and language characteristics. As discussed above, the Japanese language uses a combination of two character systems. While children learn hiragana in the first grade, the numerous kanji are learned gradually over the course of their school careers. Although the Japanese version of the EQ-5D-Y limits the use of kanji to those learned during the first 2 years of elementary school (ages [6][7][8], the impact by the dual-character language system need to be confirmed. Moreover, no studies to date have compared self-report and proxy-report for the EQ-5D-Y to assess which of the two is more appropriate for properly capturing the health states of children/adolescents.
The EQ-5D for adults is used internationally as the de facto standard of PBMs [23]. The EQ-5D-Y uses a classification system similar to the EQ-5D, although the wording is tailored to children/adolescents. This study aimed to assess the psychometric properties of the Japanese version of the EQ-5D-Y within the context of the general Japanese population of children/adolescents. We expect that the EQ-5D-Y will be widely applied to the economic evaluation of healthcare technologies for younger people, as is the case for the EQ-5D among adults.

Instruments
The EQ-5D-Y comprises five items with three levels (no, some, and a lot): "mobility" (Item 1), "looking after myself" (Item 2), "doing usual activities" (Item 3)," having pain or discomfort" (Item 4), and "feeling worried, sad or unhappy" (Item 5). The words and phrases were modified to be more child-friendly while maintaining the domains of the adult version. The Japanese version of the EQ-5D-Y was prepared by a Japanese research group, which included the present authors, based on the first draft provided by the EuroQol group. The EuroQol group completed the process of translation, back translation, and harmonization, independently of the Japanese group. A linguistic pilot study with a small sample was performed by the Japanese group during the development process. Since the EQ-5D-Y targets children/ adolescents aged ≥ 8, the Japanese version limits the use of kanji characters to those learned during the first two grades of elementary school (ages 6-8). In addition, each kanji character was provided with furigana (reading aid) using hiragana, to help children who have difficulty reading kanji. If predetermined value sets that reflect societal preferences of the general population are available, responses can be converted to an EQ-5D-Y index value. However, no value set for the EQ-5D-Y exists in any country yet, including Japan. Therefore, in the present survey, responses to the EQ-5D-Y were treated as ordinal variables.
The PedsQL Measurement Model is a generic profile-type measure of HRQOL for children/adolescents. The 23-item PedsQL Generic Core Scale consists of 23 items with five levels (never, almost never, sometimes, often, almost always) in four domains (physical functioning, 8 items; emotional functioning, 5 items; social functioning, 5 items; and school functioning, 5 items). Scores of each domain and total scores can be calculated based on responses to the questionnaire. The PedsQL has multiple versions that target different age groups. In the present study, two versions of the PedsQL were used for self-report, i.e., one for children aged 8-12 and the other for adolescents aged 13-15. A proxy version of the PedsQL has also been developed [24], and similarly has different versions based on age. Parents evaluated the health states of their children/adolescents using the appropriate proxy versions.

Data collection
The survey was conducted in three major cities in Japan (Tokyo, Osaka, and Fukuoka) from February to March 2018. Different dialects are spoken in the three cities, but the Japanese version of the EQ-5D-Y with 3 levels uses the standard Japanese language (Tokyo dialect). Accordingly, in order to consider the influence of dialects, children/adolescents from two cities outside Tokyo were included. We targeted the general population of children/adolescents aged 8-15, which corresponds to the age group targeted by the original EQ-5D-Y.
Respondents were recruited by a research company (ANTERIO Inc.), which sampled more than 600 respondents in the three cities (i.e., roughly 200 respondents at each location) by non-random sampling. The sample number was not based on any rigid statistical considerations. Children/ adolescents were stratified by sex and age. For the first-stage survey, after obtaining informed consent, parents (the father or mother) and their children/adolescents were asked to visit a specific location to answer the questionnaires to extend the commitment to take part in the survey. Children/adolescents responded to the EQ-5D-Y and PedsQL (using versions appropriate for their age) by self-report in a different room from that of their parents. Parents, in turn, were asked to evaluate their child's health states using proxy versions of the EQ-5D-Y and PedsQL (using the proxy version appropriate for the age of the child/adolescent), as well as provide demographic information. We estimate that the surveys were completed within 30 min. For the second-stage survey, the only EQ-5D-Y was mailed to their residence after approximately 2 weeks, and both children/adolescents and their parents responded. We asked the same parents who responded to the first survey to cooperate in the second-stage survey. After completion, the response sheets were sent back to the authors for analysis.
This study was approved by the ethics committee of the National Institute of Public Health, to which the corresponding author belongs (NIPH-IBRA #12179).

Reliability and construct validity
We principally followed the consensus-based standards for the selection of health measurement instruments (COSMIN) taxonomy for testing reliability and construct validity [25,26]. Reliability was confirmed by self-report test-retest methods and comparison of responses by self-report and proxy-report by parents. As described above, retest was performed 2 weeks after the first-stage survey. We based the interval on the unlikelihood that the health states of children from the general population would change between the two time points; however, we cannot deny the possibility that the health states of some children/adolescents may have changed.
Regarding construct validity (also referred to as convergent validity), we compared responses to the EQ-5D-Y with responses to and scores of the PedsQL. The PedsQL is one of the most broadly used HRQOL measures for children/ adolescents in Japan. In some studies that measured the construct validity of the EQ-5D-Y, the PedsQL was used [27,28] together with other measures, such as KIDSCREEN. However, only one HRQOL measure was used considering feasibility of this survey.

Statistical analysis
Summary statistics of background factors were calculated. The feasibility of the EQ-5D-Y was investigated by calculating the percentage of missing values. We calculated the percentage of worst-level responses because it is expected that few of the children/adolescents sampled from the general population actually have the worst-level state. Reliability was evaluated by calculating the percentage of agreement and kappa coefficients between (a) self-report test and retest responses, (b) self-report and proxy-report in the first-stage survey, and (c) proxy test and retest responses.
Hypothesis testing for construct validity was investigated by Spearman's rank correlation coefficients between responses to the EQ-5D-Y and PedsQL. We considered a correlation to be present when the absolute value of the correlation coefficient was > 0.3 (|r| > 0.3) [31]. Our hypothesis was that "mobility" of the EQ-5D-Y is correlated with the physical score of the PedsQL, and that "having pain or discomfort" and "feeling worried, sad, or unhappy" of the EQ-5D-Y correlate with the emotional score of the PedsQL. We expected high correlation coefficients between the following items due to their similarities: (1) "mobility" of the EQ-5D-Y and the first ("It is hard for me to walk more than one block") and second ("It is hard for me to run") items in the physical domain of the PedsQL, (2) "looking after myself" of the EQ-5D-Y and the fifth item ("It is hard for me to take a bath or shower by myself") of the physical domain of the PedsQL, (3) "doing usual activities" of the EQ-5D-Y and the first item ("I have trouble getting along with other kids") of the social domain of the PedsQL, (4) "having pain or discomfort" of the EQ-5D-Y and the seventh item ("I hurt or ache") of the physical domain of the PedsQL, and (5) "feeling worried, sad, or unhappy" of the EQ-5D-Y and the second ("I feel sad or blue") and fifth ("I worry about what will happen to me") items of the emotional domain of the PedsQL.

Results
A total of 654 children/adolescents responded to the questionnaires at the first-stage survey in three cities (219 each in Tokyo and Osaka, and 216 in Fukuoka). All participants (both children/adolescents and their parents) sent back their responses to the second-stage survey. Participant age and sex were well balanced. Participant characteristics are summarized in Table 1.

Feasibility
All participants (both children/adolescents and their parents) responded to all items of the EQ-5D-Y for the first-and second-stage surveys. Similarly, all participants completed the VAS. There was one missing (self-report) response for the PedsQL. Four of 654 (0.6%) children/adolescents responded to Item 1 ("mobility") with the worst-level response. Similarly, the numbers of children/adolescents who responded with worst-level responses were 0 (0%) for Item 2; 3 (0.5%) for Item 3; 27 (4.1%) for Item 4; and 22 (3.4%) for Item 5. On the other hand, the numbers of children who chose best-level responses were 495 (75.7%) for Item 1; 630 (96.3%) for Item 2; 593 (90.7%) for Item 3; 391 (59.8%) for Item 4; and 370 (56.6%) for Item 5. Only one child/adolescent had a VAS score of 0, and 23 (3.5%) had VAS scores < 50 points. In contrast, only two parents (0.3%) regarded their children's health state as being < 50 points by VAS. The 25th percentile of VAS score was 74 (self-report) and 80 (proxy-report) points.

Reliability
The mean period between responses to the first-and second-stage surveys was 16.9 days (SD: 2.2 days). The Chisquare test was performed to compare the distribution of EQ-5D-Y responses between self-report and proxy-report. This revealed significantly better evaluations by proxy-report compared to self-report. The percentage of agreement and Kappa coefficients (PABAK) of the EQ-5D-Y are provided in Table 2. The highest percentages of agreement and kappa coefficients were observed between proxy-report at the first-and second-stage surveys, whereas a comparison of self-report and proxy-report showed the lowest agreement. Items related to mental state ("having pain or discomfort" and "feeling worried, sad or unhappy") had lower kappa coefficients among all three comparisons [(a) self-and selfreport, (b) self-and proxy-report, and (c) proxy-and proxyreport], compared with the other three more physical items. In particular, kappa coefficients of mental items between self-report and proxy-report were less than 0.2, suggesting poor agreement. Kappa coefficients among all three comparisons for "mobility," "looking after myself," and "doing usual activities" were > 0.5 (fair to perfect agreement). "Looking after myself" had the highest kappa coefficient, at 0.9 (perfect agreement). Figure 1 shows the percentage of agreement by age based on binary data ("no problem" and "any problems"). There is a tendency that the agreement between the pairs [(a) selfand self-report and (b) self-and proxy-report] is better for some items if the children/adolescents are getting older. Logistic regression confirmed the relationship between agreement of self-report responses at the two time points and background factors (Table 3), and relationship between agreement of self-and proxy-report responses and background factors (Table 4), and relationship between agreement of proxy-report responses at the two time points and background factors (Table 5). Almost all demographic factors were not significantly related to agreement, but agreement of some items is higher in older children/adolescents group (junior high school students) except Table 5. The relation seems stronger in the comparison of self-report and proxy-report. Percentage of agreement by subgroup is also shown in Tables 6, 7 (N = 654, SD: 12.1) by proxy-report in the second-stage survey. VAS scores by proxy-report were significantly higher than scores by self-report (paired t test; P < 0.0001). Moreover, variance in VAS scores by proxy-report was smaller than that by self-report (F test; P < 0.0001). ICCs of VAS scores were as follows: 0.40 between self-report at the first-and second-stage surveys, 0.06 between self-report and proxy-report at the first-stage survey, and 0.31 between proxy-report at the first-and second-stage surveys. Agreement of VAS scores between self-report and proxy-report tended to be poor. Table 9 shows the correlation matrix between EQ-5D-Y responses and PedsQL scores by self-report in the firststage survey. Item 1 of the EQ-5D-Y was correlated with the physical score of the PedsQL, but none of the mental scores. Similarly, Item 5 of the EQ-5D-Y was correlated with the emotional score of the PedsQL, but not the physical score. Item 4 of the EQ-5D-Y had correlation coefficients > 0.3 with physical, emotional, and social scores of the PedsQL. Table 10 shows the correlation matrix between selfreport responses to the EQ-5D-Y and PedsQL. Consistent with the hypotheses described in the Statistical analysis section, all correlation coefficients were > 0.3 (|r| > 0.3), except for (c) "doing usual activities" and the first item of the social domain of the PedsQL. No correlation was observed between proxy-report for the EQ-5D-Y and self-report for the PedsQL (Table 11). None of the coefficients exceeded our criteria. Proxy-report for the EQ-5D-Y had a lower construct validity than self-report for the EQ-5D-Y.

Discussion
In this study, we surveyed psychometric properties of the Japanese version of the EQ-5D-Y. Our results suggest the EQ-5D-Y by self-report was feasible for Japanese children/ adolescents aged 8-15. In terms of reliability, test-retest agreement was sufficiently high. For some items of the EQ-5D-Y, reliability for junior high school students was higher than that for elementary school students. The relation seems stronger in the comparison of self-report and proxy-report. Parents can understand children/adolescents' health states better as they are growing up.
Reliability based on the kappa coefficients of self-report and proxy-report was different for the physical items and mental items. The proxy-report test-retest reliability was higher than the self-report test-retest reliability. Younger children's feeling is generally more easily changed. In some cases, younger children responded to each item influenced by non-health-related events. For example, according to our experience of interview, a young child said that he/she is sad because my mom scolded him/her. Another child told us that it is difficult for him/her to walk around about because he/she got tired from coming here. Considering lower reliability, we may need to use self-report EQ-5D-Y more deliberately, and interpret the results more carefully than EQ-5D for adults. Construct validity of the EQ-5D-Y by self-report was confirmed by comparisons with the PedsQL. No correlations were observed between responses to the EQ-5D-Y by proxy-report and PedsQL by self-report. Overall, the result suggests that proxy responses may not sufficiently capture the health states of children/adolescents. Some studies have reported on the psychometric properties of the EQ-5D-3L and -5L for adults in the general population [32][33][34][35][36][37][38]. However, reports on properties of the EQ-5D-Y in the general population are limited [27,28,39]. As the psychometric validation of EQ-5D-Y in general population is difficult in implementation, the small samples of disease-specific populations were reported in previous EQ-5D-Y validity studies. This fact justifies the novelty of our study. In a study that assessed the reliability of the selfreported EQ-5D-Y at two time points (7-10 days interval) [27], percentages of agreement (two categories of '"no problem" and "any problems") were 91.5% (Item 1), 93.8% (Item 2), 82.9% (Item 3), 69.8% (Item), and 78.3% (Item 5) in Italy, and 99.4%, 99.7%, 97.5%, 86.2%, and 87.4%, respectively, in Spain. Our present findings suggest that agreement in Japan is lower than those of Italy and Spain, although a common feature was the low agreement for items related to emotion. The ICCs of EQ-5D-Y VAS were 0.82 in Italy and 0.83 in Spain. These ICCs are higher than the ICC in Fig. 1 a Percentage of agreement (three categories) between self-report at first-and secondstage surveys. b Percentage of agreement (three categories) between self-report and proxyreport at first-stage survey   and 76.9%, respectively, in Spain. This may be one reason for the lower percentage of agreement in Japan. The ceiling effect of the EQ-5D-3L is well known and the 5L version was developed to address this issue [40,41]. In some instances, children/adolescents who feel that "no problem" is not entirely accurate, but feel better than the description of the next level, might rate their health state as being the first level (i.e., "no problem"). In other instances, they might instead rate their health state as the next level.  One strength of our survey was that we achieved a 100% collection rate in the second survey. Only one response had a missing value for the PedsQL, but there were no missing values for the EQ-5D-Y in both surveys. This may have reduced any bias caused by missing values. However, there are also some potential limitations worth noting. First, the survey environment differed between the first-and secondstage surveys. Specifically, the first-stage survey was performed in a meeting room that respondents visited, whereas the second-stage survey took the form of a mail survey. We cannot deny the possibility that the change in environment could have influenced the responses in some way. Second, we targeted children/adolescents from the general population, who are likely to be relatively healthy. Further research will be needed to confirm the psychometric properties of the EQ-5D-Y for children/adolescents in clinical settings. Third, an external anchor was not used for selecting children/ adolescents with stable health conditions over the 2 weeks for test-retest reliability assessment. The effect of changes in health conditions may be included in test-retest reliability assessment. We also did not obtain information about the illness of children/adolescents. Finally, we used only one HRQOL measure to assess construct validity. Specifically, we used only the PedsQL, since it is one of the most widely  used HRQOL measures in Japan. We adopted this approach after considering the burden of responding to multiple questionnaires, but additional comparisons with other measures, such as KIDSCREEN, may help further confirm the construct validity of the EQ-5D-Y.
In conclusion, our results demonstrate that EQ-5D-Y is a feasible measure for use with children/adolescents in Japan, with sufficient reliability and validity. Proxy-report to the EQ-5D-Y was more reliable than self-report, but no construct validity was observed.
Funding This study was partly funded by JSPS KAKENHI (Grant Number JP16K08898).

Compliance with ethical standards
Conflict of interest All authors declare no conflict of interest.
Informed consent Informed consent was obtained from all individual participants included in the study.
Research involving human participants and/or animals All procedures performed in studies involving human participants were in accordance with Ethical Guidelines for Clinical Research of the Japanese Ministry of Health, Labour and Welfare and with the 1964 Helsinki Declaration and its later amendments.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.