FormalPara Key Points for Decision Makers

The application of child-specific preference-based measures enables the calculation of utilities for cost utility analysis of health technologies targeted for paediatric populations.

Proxy reports (e.g., parent/guardian or a health professional), used in lieu of child self-reports in circumstances when self-reports are not feasible, can often diverge from the child’s assessment of their own HRQoL.

This review examined the agreement between the child self- and proxy-reported overall and domain-level HRQoL using generic preference-based measures.

In general, the inter-rater agreement was poor for overall utilities across the measure/s applied and/or the context of the application. In addition, the agreement between children and proxy respondents within the domains of the respective measures was lower for psychosocial-related attributes compared with physical attributes.

1 Introduction

Evidence from economic evaluation is increasingly being utilised by regulatory bodies such as the Pharmaceutical Benefits Advisory Committee (PBAC) in Australia and the National Institute for Health and Care Excellence (NICE) in parts of the UK to evaluate the cost effectiveness of health technologies targeted for paediatric populations [1]. PBAC, for example, considers evidence derived from measures of health-related quality of life (HRQoL) when recommending medicines eligible for government subsidies under the Pharmaceutical Benefits Scheme (PBS) [2]. Economic evaluations involving cost-utility analysis (CUA) have become the most prevalent approach for providing health economic evidence to assess the cost effectiveness of new health technologies for adult and paediatric populations. Within CUA, outcomes are most typically presented as quality-adjusted life-years (QALYs). The QALY combines ‘utility’ indexed on a 0–1 scale (where 0 is equivalent to being dead and 1 is equivalent to full health) and length of life into a single generic measure of health outcome, thereby facilitating comparisons of the health gains generated from alternative interventions [3, 4].

The application of child-specific preference-based measures enables the derivation of utilities (preference weights) for incorporating into CUA of health technologies targeted for paediatric populations [5]. In a previous review of validated measures, Chen and Ratcliffe identified nine generic preference-based measures that have been applied to measure and value HRQoL in children and adolescents: Quality of Well-Being Scale (QWB), Health Utilities Index Mark 2 (HUI2), Health Utilities Index Mark 3 (HUI3), Sixteen-dimensional measure of health-related quality of life (HRQoL) [16D], Seventeen-dimensional measure of HRQoL (17D), Assessment of Quality of Life 6-Dimension (AQoL-6D) Adolescent, Child Health Utility 9 Dimensions (CHU9D), EQ-5D Youth version (EQ-5D-Y) and Adolescent Health Utility Measure (AHUM). Preference-based measures comprise two main components: a descriptive system for measuring HRQoL, and a preference-based scoring algorithm for generating utilities. The descriptive systems of the identified nine generic preference-based measures that have been applied to measure and value HRQoL in children and adolescents differ in the content, type, absolute number of HRQoL dimensions (domains/attributes) and/or response levels included. Similarly, the preference weighted scoring algorithms (value sets) for these measures also differ according to the methods used to generate the value set, e.g., time-trade off (TTO), standard gamble (SG) or discrete choice experiments (DCEs) and the population from whom the value set was derived, e.g. adults or young people [3].

Ideally, the individual themselves should be the principal source of information about their own HRQoL [1]; however, self-assessment of HRQoL is challenging in the paediatric population. According to the Professional Society for Health Economics and Outcomes Research (ISPOR) Good Research Practices Patient-Reported Outcomes (PRO) Task Force Report, there is insufficient evidence to determine whether self-reporting of HRQoL by children under 8 years of age is reliable or valid [6]. Furthermore, older children with conditions associated with neurodevelopmental delays may be unable to self-assess their own HRQoL due to limited cognitive abilities. Such circumstances may require relying on an adult proxy such as a parent/guardian or a health professional to assess the child’s HRQoL [7].

It is well-documented that proxy assessments of HRQoL in any population group tend to differ from self-assessments, with proxy assessors typically reporting lower HRQoL than the person themselves [1, 6, 8, 9]. Two previous systematic reviews by Khadka et al. and Jiang et al. of child self- and proxy-reported child utilities found that utilities tended to differ, with proxies often underestimating the child’s HRQoL [10, 11]. In child populations, there is some evidence to indicate that proxy assessment of the child’s HRQOL may be influenced by external factors, e.g. mother’s assessment of the child’s HRQoL may be influenced by their own HRQoL [12].

In their systematic review, Jiang et al. examined the difference in self- and proxy-reported utilities [11]. Child HRQoL ratings obtained by two different observers, the child self and the proxy, are likely to differ owing to the differences in their perspectives. Therefore, it is also important to determine the extent to which the two raters agree or assign the same rating for an item being measured, i.e., to report inter-rater agreement measures that estimate the strength of agreement between raters [13, 14]. This systematic review sought to add to the existing evidence by focusing on reported measures of agreement in child and proxy assessments of paediatric HRQoL using established generic preference-based measures, highlighting individual domain-level differences in agreement, in addition to overall utilities. This study also presents the methods and findings from a meta-analysis of reported agreement statistics to provide an overall indication of the extent of agreement in child self and proxy assessments of paediatric HRQoL according to the available evidence.

2 Methods

2.1 Search Strategy

The literature search strategy was adapted from a previous study undertaken by Khadka et al., and the search keywords were reproduced [10]. The time frame covered by the previous search was from inception to 30 July 2017. To reflect the latest publications during the 4-year period since the initial search undertaken by Khadka and colleagues, this review incorporated peer-reviewed articles published in electronic journals between 30 June 2017 and 19 May 2021. The online databases searched included PubMed, The Cochrane Library, Web of Science, EconLit, Embase, PsycINFO and CINAHL (via EBSCOhost). Key words such as ‘utility’, ‘quality-adjusted life years’, ‘children’, ‘adolescents’, and ‘preference-based measure of HRQoL’ as well as related Medical Subject Headings (MeSH) terms were used for the systematic literature search. A detailed account of the search terms and the strategy is presented in Appendix 1 (see electronic supplementary material [ESM]). The identified studies were screened using the web-based systematic review software Covidence [15]. This review is registered with the International Prospective Register of Systematic Reviews (PROSPERO; registration number CRD42021256815). The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement guidelines were used for reporting this review (Appendix 2, see ESM) [16].

2.2 Inclusion and Exclusion Criteria

All studies published in English with full-text availability were included. Eligible studies included primary studies applying generic preference-based measures to derive health utilities amenable to QALY calculations in a paediatric population as assessed by the child (from hereon, child or children refer to all school-age children and adolescents, i.e., between 5 and 18 years of age unless stated otherwise) and proxy dyads. Inclusion criteria were studies reporting the agreement level for overall and/or domain-level paediatric HRQoL by both children and the proxies reporting on behalf of the children. Those studies that reported the paediatric health state utilities as assessed by child (self) and proxy respondents but did not include the agreement statistics were excluded. Additionally, as this systematic review focused on studies applying generic preference-based measures to derive health utilities, primary studies conducted among the paediatric populations were excluded if the utilities were obtained (1) directly using SG, TTO and VAS, or (2) indirectly using condition-specific (as opposed to generic) HRQoL measures.

2.3 Article Screening

Article screening was carried out in three steps. In the first step, two independent reviewers (DK and KL) screened the titles and abstracts based on the inclusion and exclusion criteria. Records with conflicting decisions were deferred to a third reviewer to reach a consensus. Articles selected at the screening stage were then included for a full-text review in the second step. The same two reviewers reviewed all the articles included in this stage. Simultaneously, two other reviewers (JK and CMK) independently assessed 10% of the articles in total to confirm the decisions of the former pair of reviewers. Following a discussion with the initial reviewing pair and the other reviewers (JR, JK, CMK) to reach a consensus, full-text articles that met the criteria were included. In the final step of this process, all eligible articles were subsequently consolidated and information relevant to the study was extracted.

2.4 Data Extraction

Data extraction was performed by the first author (DK). Each article was assessed to retrieve the following information: bibliographic details, geographic setting, study design, health state experienced, the generic preference-based measure used, target sample size, age range of the children included, sample gender composition, proxy type and sample size, mode of administration for both individuals in the dyad, statistical test(s) that report the overall and/or domain level of agreement between self- and proxy-reported HRQoL, and any reported methodological concerns. A Microsoft Excel (Version 2019; Microsoft Corporation, Redmond, WA, USA) database was used to enter and store the extracted data.

2.4.1 Extraction and Interpretation of Agreement Statistics

Inter-rater agreement is the degree to which the assessments of two or more individuals (raters) are identical using the same measure and assessing the same subject. There are multiple methods to measure inter-rater agreement based on the type of variable (continuous or categorical) and the number of raters. Agreement measures such as the intraclass correlation coefficient (ICC), Cohen’s kappa (κ), Bland–Altman plots, percentage agreement and Gwet’s agreement coefficient (AC1) assess the degree to which the assessments by the individual raters are identical or in agreement based on the type of data (e.g., nominal or continuous) [14, 17]. Correlation coefficients, also commonly reported to indicate agreement, determine the linear relationship between two continuous variables (Pearson’s product-moment correlation or Pearson’s r) or two ranked variables (Spearman’s rho) [18].

It is important to note that in statistical analysis, correlation coefficients (e.g., Pearson’s r) are considered as suboptimal measures of inter-rater agreement. They only provide a measure of the strength of a linear association between scores by raters and may indicate strong correlations even in the presence of a significant difference between the HRQoL assessments if the scores by both raters vary similarly. As a result, correlation coefficients may over- or underestimate the true level of agreement and inaccurately reflect the degree of agreement between raters [14, 18,19,20]. Inter-rater agreement is also often estimated using the percentage agreement approach [20]. However, percentage agreement does not correct for the level of agreement resulting from a random decision made by the raters. Cohen’s kappa accounts for this random agreement and is more robust [21]. Therefore, percentage agreement is excluded from this review as a measure of child and proxy agreement. Only two studies reported the inter-rater agreement using the Bland–Altman plot and were thus not included in this review.

Thus, in the present study, to examine the concordance in the paediatric HRQoL obtained by self and proxy reports, we treat the ICC and kappa values as primary evidence. In addition, the results of the correlation coefficients, both Pearson’s r and Spearman’s rho, are presented as supplementary evidence.

ICC’s can take a value between 0 and 1, whereas kappa and correlation coefficient statistics range from − 1 to 1. Values for ICCs < 0.5 indicate poor agreement between raters, whereas values between 0.5 and 0.75, 0.75 and 0.9, and > 0.9 indicate moderate, good, and excellent agreement, respectively [22]. Spearman’s correlation coefficients with a value < 0.20 represent no correlation, values between 0.20 and 0.35 represent weak correlation, values between 0.35 and 0.50 represent moderate correlation, and values ≥ 0.50 represent strong correlation [23]. Pearson’s r coefficients are interpreted using Cohen’s conventions. The correlation is small if the coefficient is 0.30 or less, medium if it is 0.50 or less, and large if it is > 0.50 [24]. Cohen’s kappa and Gwet’s AC1 have similarly defined thresholds, with classifications defined as slight (poor), fair, moderate, substantial (good) and almost perfect (very good) correlation for values ≤ 0.2, 0.4, 0.6, 0.8 and 1, respectively [17, 25].

2.5 Data Synthesis and Analysis

The estimates of the agreement level between child self- and proxy-reported HRQoL were described using a textual approach in the form of a narrative synthesis [26, 27]. Several studies did not report the mean age of participating children in the dyad, and hence only the age range was analysed. Studies that included children with cancer along with other chronic illnesses were identified as non-cancer-related studies. Caregivers reporting as proxies on behalf of children were grouped under parents. When the type of correlation was not mentioned in the study, it was assumed to be Pearson’s r.

A meta-analysis was performed on a subset of the studies to synthesise the quantitative information and estimate the overall and domain-level agreement between child self- and proxy-reported HRQoL. To obtain an average estimate of inter-rater agreement, we synthesised the ICCs for overall utilities as they are reported on a continuous scale. Similarly, considering the ordinal nature of the responses within the attributes, kappa statistic was used to estimate the domain-level inter-rater agreement. Studies reporting only the correlation coefficients were excluded from the meta-analysis.

The meta-analysis was conducted using Stata 16.1 (Stata Corp LLC, College Station, TX, USA). Since the assumption of homogeneity is not reasonable for the present data due to the diverse nature of the target samples in consideration, we used a random-effects model to allow for between-study variability in effect sizes. The weights were estimated using a restricted maximum likelihood (REML) method [28]. A Fisher’s z-transformation was applied to obtain an approximately normal sampling distribution in order to calculate the 95% confidence intervals (CIs) for each ICC for the overall utilities. The z-scores were then transformed back into correlations for ease of interpretation [29].

For the domain level meta-analysis, the standard errors (\(\mathrm{se}\)) for kappa values (\(\widehat{\kappa }\)) were calculated using the following formula (Eq. 1):

$${\mathrm{se}}_{\kappa }=\sqrt{\frac{p(1-p)}{{n(1-{p}_{\mathrm{c}})}^{2}}},$$

where \(p\) is the observed percentage agreement, \(n\) is the number of rater pairs and \({p}_{\mathrm{c}}\) is the agreement expected by chance. However, since no study reported the values for \({p}_{\mathrm{c}}\), but did report \(p\) and \(\widehat{\kappa }\), \({p}_{\mathrm{c}}\) was calculated as shown in Eq. (2) [30]:

$${p}_{\mathrm{c}}=\sqrt{\frac{p-\widehat{\kappa }}{1-\widehat{\kappa }}}.$$

A forest plot was used to depict the results of the meta-analysis (overall agreement). Heterogeneity was assessed using a forest plot as well as Cochran’s test of homogeneity (Q statistic) and the I2 statistic. Each sample was considered unique if any of the following variables relevant to the analysis were unique: type of proxy, measure, health condition, or age group composition (i.e., if children below 8 years of age were included in the sample). An exploratory meta-analysis (assuming a random-effects model) was conducted to estimate the moderation by these variables. A random-effect meta-regression was used to supplement the findings of the meta-analysis, as the studies were not considered sufficiently similar for a fixed-effects model [31]. The sample was also considered to be unique if the same sample was examined in a different time period for longitudinal studies. Publication bias was evaluated using funnel plots and a regression-based funnel plot asymmetry test.

2.6 Risk of Bias and Quality Assessment

Two independent reviewers (DK and JK) appraised the quality and suitability of the included studies. The overall reporting quality score was calculated using a checklist for quantitative studies as given by Kmet et al., and was used to assess the risk of bias [32]. From each of the selected articles that met the inclusion criteria, we extracted information for 14 quality indicator variables (details provided in ESM Appendix 3). Two points were assigned to each of these variables if they were appropriately reported in the article, one if the item was incompletely reported, and none if not reported at all. The sum of all the points indicated the overall reporting quality score of the article, with 28 being the maximum. The summary scores were rescaled between 0 and 1, with 1 denoting the highest quality. If the item was not applicable to a particular study, scores were adjusted by excluding the total possible scores of those items from the summary score. The minimum threshold for inclusion of studies based on quality scores was set at 0.6. The results of a sensitivity analysis carried out using the criteria by Papaioannou and colleagues to confirm the conclusions from the former appraisal are reported in Appendix 4 (see ESM) [33].

3 Results

3.1 Search Results

A PRISMA flow diagram illustrates the selection process (Fig. 1). An extensive literature search of seven databases was conducted using the search strategy described above. 43,522 records published between 30 June 2017 and 19 May 2021 were identified and were subsequently imported into Covidence; 19,309 records were deduplicated by Covidence, leaving 24,213 records for title and abstract screening. Of these, the vast majority (23,547) were excluded. Reasons for exclusion were (1) non-primary studies; (2) non-paediatric target population; (3) no health state utilities reported; (4) inaccessible articles; and (5) English was not the main language of publication. Subsequently, 666 records were included in the full-text review stage. At this stage, in addition to the previously specified exclusion criteria, studies were excluded if agreement statistics between the child self- and proxy-reported health state utilities and/or at domain level were not reported. In total, 30Footnote 1 studies fully met the inclusion criteria and were thus included in the final review.

Fig. 1
figure 1

Literature search flow diagram using the PRISMA checklist. PRISMA Preferred Reporting Items for Systematic Reviews and Meta-Analyses. 1Thirty studies were included in the final review. The two papers by Glaser et al., i.e. ‘Standardized quantitative assessment of brain tumor survivors treated within clinical trials in childhood’ [36] and ‘Applicability of the Health Utilities Index to a population of childhood survivors of central nervous system tumours in the UK’ [37], were published in two different journals but used the same sample to report different results. To prevent double counting, these two papers were considered as one

3.2 Main Characteristics of the Studies

Table 1 presents an overview of the studies included in this systematic review. All the studies appraised for quality of reporting were of high quality, scoring 0.7 and over. The following study designs were employed: cross-sectional (83%), longitudinal (23%), and case-control (3%). HRQoL measures applied to obtain health state utilities either independently or in combination with other measures included the HUI3 (57%), EQ-5D measures (EQ-5D-Y-3L, EQ-5D-Y-5L, EQ-5D-3L, and the EQ VAS; 37%), HUI2 (33%), CHU9D (7%), and the QWB scale (3%). Cancer or history of cancer was the most common condition for which HRQoL was assessed (27%), predominantly blood and brain malignancies. Some studies (30%) also included children from the general population as the target sample or as the comparator/control group. The proxy respondent was exclusively a parent (mother, father, or a caregiver) in most of the identified studies (83%). Several studies (17%) used health professionals (nurses, physicians, and physiotherapists) or teachers as proxies, together with parents. The only exception was the study by Barr et al., which used only nurses and physicians for proxy-reported utilities using HUI2 and 3 in cancer survivors [34]. Each study administered the proxy version of the measures adopting a proxy/proxy perspective, except one [35], which used a proxy/patient perspective (asking the proxy to rate the child’s HRQoL from the child’s perspective).

Table 1 An overview of the included studies

The measures were either administered by a trained interviewer (50%) or self-completed by the children (47%). One study used both an interviewer administration mode for children below 8 years of age and self-completion for the older children [36, 37]. The majority of the studies (83%) reported the inter-rater agreement for overall utilities. Five studies only reported the domain-level agreement [35, 38,39,40,41]. When reported, ICCs were slightly more commonly represented (60%) than correlation coefficients in measuring the overall child/proxy agreement level. Cohen’s kappa (59%) was the most frequently used measure of agreement at the attribute level, followed by ICC (18%) and Gwet’s AC1 (12%).

A summary of the included studies is presented in Tables 2 and 3 grouped into cancer- and non-cancer-related conditions, respectively. All the included studies were published between 1994 and 2021 and used primary data to obtain child health state utilities by employing generic preference-based measures. Majority of the studies were published in North America (USA and Canada; 33%) and Europe (UK, Spain, Netherlands, and Germany; 33%), followed by Asia (Thailand, Japan, Hong Kong, and China; 17%). Forty-five unique dyad samples based on the proxy type were included in the studies, with a total pooled sample of 3084 children and 3300 proxies. The age range for children in the included studies was between 5 and 18 years. Eight studies reported children younger than 8 years of age completing a self-report questionnaire either independently or with some assistance [35,36,37, 40, 42,43,44,45,46].

Table 2 Details of the cancer studies that reported dyad self and proxy HRQoL using preference-based quality-of-life instruments
Table 3 Details of the studies with health conditions other than cancer that reported dyad self and proxy HRQoL using preference-based quality-of-life instruments

3.3 Proxy/Child Agreement

Table 4 presents a summary of reported agreement statistics for overall utilities using ICCs or correlation coefficients, i.e., Pearson’s r and Spearman’s rho. The studies used all the identified measures except for the EQ-5D-Y-5L, and employed both caregivers and health professionals as proxies. The sample size of the dyad ranged from 11 [45] to 654 [47]. From a total of 26 studies (58 samples), 12 studies reported only the ICCs [34, 42, 43, 46,47,48,49,50,51,52,53,54], and three studies reported ICCs alongside the correlation coefficients [36, 37, 55, 56]. Six studies reported only Spearman’s rho [45, 57,58,59,60,61], whereas four studies reported only Pearson’s r [44, 62,63,64]. Details of the included studies reporting the domain-level agreement statistics are presented in Table 5. The domain-level agreement was reported for 17 studies (40 samples), of which 10 studies used Cohen’s kappa [34,35,36,37,38,39,40,41, 46, 47, 51], three studies used ICC [42, 43, 49], and two used Gwet’s AC1 [53, 54]. No study reported the domain-level agreement for the CHU9D and QWB measures.

Table 4 Details of the included studies of level of agreement by overall utilities between self- and proxy-reported HRQoL using preference-based quality-of-life instruments
Table 5 Details of the included studies’ level of agreement by domains (attributes) between self- and proxy-reported HRQoL using preference-based quality-of-life instruments

3.3.1 Inter-Rater Agreement Based on the Type of Measure

HUI2 and 3 The inter-rater agreement between children and proxies for nine studies as indicated by the ICCs was poor for overall utilities [34, 36, 37, 42, 43, 48,49,50, 55, 56]. The overall ICC for HUI2 was slightly higher than that of HUI3. In contrast to HUI2, which showed good to excellent agreement for the overall utilities for one-quarter of the samples in the studies, the agreement using HUI3 was moderate at best. The correlation coefficients obtained from 10 studies indicated moderate associations between child self and proxy reports [36, 37, 44, 45, 55,56,57,58,59,60, 63].

Across the HUI2 attributes of ‘emotion’, ‘cognition’ and ‘pain’, the overall kappa values indicated fair agreement for those domains with a moderate agreement for ‘sensation’. Overall, the kappa values suggested a substantial agreement for ‘mobility’, the highest level of agreement among all attributes, and a moderate agreement for ‘self-care’ between the child/proxy dyad [34, 36, 37, 39]. The lowest kappa values were reported for ‘emotion’ and ‘cognition’ in the assessment of HRQoL by children and proxies. For the ‘pain’ attribute, both slight and substantial levels of agreement were reported equally among the samples.

For HUI3, the overall agreement using kappa values was fair for ‘cognition’, ‘emotion’, ‘speech’ and ‘pain’; moderate for ‘hearing’, ‘dexterity’ and ‘ambulation’; and substantial for ‘vision’ [36,37,38,39, 41]. Similar to HUI2, the lowest agreement between children and proxies for HUI3 attributes was reported for ‘emotion’ and ‘cognition’. In contrast, high kappa values were frequently reported for the attributes of ‘vision’, ‘ambulation’ and ‘dexterity’, with the agreement level ranging from substantial to almost perfect.

The ICC values demonstrated a poor agreement for subjective domains (‘emotion’, ‘cognition’, and ‘pain’) with some even reporting negative values. The agreement was between good to moderate for the observable domains of sensation, mobility, self-care, vision, hearing, and dexterity, with the notable exception of ‘ambulation’ and ‘speech’, which showed poor inter-rater agreement [42, 43, 49]. The agreement within the ‘ambulation’ and ‘speech’ attributes was moderate only in one instance between cancer survivors and their parents [43].

EQ-5D measures and the EQ VAS None of the studies reported the ICCs for the overall utilities or the summary scores using EQ-5D measures. Of the six studies reporting the ICCs for the EQ VAS scores, the majority showed poor agreement between child/proxy dyads [46, 47, 51,52,53,54]. However, an improvement in the inter-rater agreement was noted from baseline to follow-up [51, 54]. Kappa statistics reported for five studies indicated, on average, fair agreement between children and parents for all domains of EQ-5D [35, 40, 46, 47, 51]. The agreement was the lowest for the ‘feeling worried, sad, or unhappy’ and ‘having pain or discomfort’ domains, followed by, ‘doing usual activities’, ‘looking after myself’ and the highest for ‘walking about’.

The inter-rater agreement between children and proxies within the EQ-5D domains using Gwet’s AC1 ranged from moderate to very good [53, 54]. Children and adolescents with haematological malignancies were assessed using both 3L and 5L versions of the EQ-5D-Y in the study by Zhou et al. They found moderate to good agreement between the self- and caregiver-reported HRQoL for the five dimensions. The agreement improved from baseline to follow-up for all except the ‘having pain or discomfort’ domain in the 3L version and the ‘walking about’ and ‘looking after myself’ domains in the 5L version. However, no significant difference between the 3L and 5L versions was reported [54]. Among children with Adolescent/Juvenile idiopathic scoliosis (AIS/JIS), Lin et al. showed very good agreement with the caregivers in all domains except the ‘having pain or discomfort’ and ‘feeling worried, sad, or unhappy’ domains [53].

CHU9D and QWB The only study that reported the ICC using CHU9D showed moderate inter-rater agreement [50]. Using a large sample of 384 child/parent dyads, Rogers et al. reported a weak but significant correlation between the child self and proxy reports using CHU9D [64]. In their study, Czyzewski et al. reported a moderate correlation between the self- and proxy-reported utilities using QWB [62].

3.3.2 Inter-Rater Agreement Based on the Type of Proxy

Both types of proxies (parents and health professionals) showed poor inter-rater agreement, although parents showed higher agreement overall, regardless of measures and/or health conditions. All studies using health professionals as proxies assessed the HRQoL of children with cancer or child cancer survivors. Among these, Fluchel and colleagues used physicians and teachers as proxies for the children in the control group with no health condition [43]. A negative ICC (− 0.31, 95% CI − 0.22 to 0.262) was noted, indicating poor inter-rater agreement between the pair [43]. Only one study showed good to excellent agreement between cancer survivors and health professionals (nurses and physicians) using HUI2 [34]. Glaser and colleagues compared the inter-rater agreement between children with a history of cancer and their parents, physicians, and physiotherapists. Both the agreement (ICC) and correlation (Pearson’s r) values were better for parents, closely followed by physiotherapists, and worst for physicians [36, 37]. In the study by Ungar et al., the authors found a poor inter-rater agreement when children and parents reported paediatric HRQoL separately using the HUI2 and 3; however, the agreement was found to be statistically significant and moderate using a consensus-based dyad approach [49].

The agreement between children and physiotherapists was generally low with the exception of one study where physiotherapists reported higher agreement than parents and physicians within the HUI3 attributes of ‘vision’ and ‘speech’ [36, 37]. Overall, physicians reported excellent agreement when assessing the functional attributes, e.g., ‘mobility’ and ‘ambulation’, whereas the subjective attributes of ‘emotion’, ‘pain’ and ‘cognition’ lacked sufficient agreement [36, 37, 39, 42, 43].

Parents followed a similar suit and reported slight to fair agreement within the ‘emotion and ‘cognition’ attributes of HUI2 and 3. In the assessment of ‘emotion’, the only exception was reported in a study of children with very low birth weight by Wolke et al., which showed moderate agreement with the parents in the study population [41]. Moreover, father/child pairs agreed only slightly within all domains of EQ-5D-Y. In comparison, a better agreement was reported with mothers for the domains ‘walking about’, ‘doing usual activity’ and ‘having pain or discomfort’ [46].

3.3.3 Inter-Rater Agreement Based on the Type of Condition

Within the cancer-related studies, children with a history of cancer showed a much better agreement (ICC 0.44, 95% CI 0.26–0.62) with the proxy reports than those with active cancer (ICC 0.34, 95% CI 0.04–0.64). In addition to the higher agreement level, correlations observed were also large for the former cohort (0.52, 95% CI 0.31–0.68), whereas cancer patients showed weak associations (0.40, 95% CI − 0.15 to 0.76) with the proxy reports of their HRQoL. It is unclear if cancer-related studies showed an overall lower agreement between the child self and proxy reports of HRQoL, than studies with conditions other than cancer. For instance, in a longitudinal study of cancer patients, Penn and colleagues found strong associations between the HUI3 generated overall utilities as reported by children and proxies in the study population, but weak correlations for those in the control group [59]. Conditions such as respiratory (asthma) and musculoskeletal diseases assessed using HUI2 and 3 showed poor inter-rater agreement between child self- and proxy-reported utilities [49, 55]. Using the EQ VAS, van Summeren and colleagues found good inter-rater agreement in children with functional constipation [52]. Additionally, in a longitudinal study of children with obesity, the agreement between children and parents for EQ VAS scores was found to be moderate at baseline and at follow-up [51]. Strong associations (Spearman’s rho) were noted between the utilities reported by children with cerebral palsy, hemiplegia, and/or muscular dystrophy and their parents using both EQ-5D-Y and EQ VAS [45], while the correlation between children with thalassaemia and their caregivers using the EQ VAS was weak [61]. Kulpeng et al. also indicated a large correlation (Pearson’s r) between self- and proxy-derived utilities using EQ-5D and EQ VAS in children with severe childhood infections [44].

The agreement and correlation between child self- and proxy-reported overall HRQoL observed between healthy children and proxies, including parents, physicians, and teachers, was, on average, low [43, 47]; however, evidence for the domain-level agreement was inconsistent. Kappa values in the study by Wolke et al., suggested moderate to almost perfect agreement between children with no specific health condition and parents across all HUI3 attributes [41]. In contrast, another study observed perfect agreement only within the ‘hearing’, ‘ambulation’, and ‘dexterity’ attributes, while the remaining attributes showed poor or no agreement [43]. Notably, this study used physicians/teachers as proxies rather than parents, which could potentially account for the contrasting findings. Similarly, one of the two studies using the EQ-5D-Y reported a moderate to almost perfect agreement across all domains except ‘having pain or discomfort’ and ‘feeling worried, sad or unhappy’, while the other reported lower agreement ranging from slight to fair across all domains [35, 47].

3.4 Meta-Analysis Results

In the following, results for the meta-analysis are provided for studies that reported the ICC (95% CI) for the overall utilities and Cohen’s kappa for the domain-level HRQoL. Nine studies were included in the analysis to estimate the ICC for overall utilities elicited using child-specific generic preference-based measures [34, 36, 37, 42, 43, 48,49,50, 55, 56]. Six studies that reported the ICCs for EQ VAS scores were excluded as there is some debate in the literature about VAS scores and the extent to which they can be interpreted as utilities [46, 47, 51,52,53,54]. Kappa statistics for the domain-level agreement were reported for 10 studies employing HUI2 and 3 (five studies) [34, 36,37,38,39, 41] and EQ-5D-Y (five studies) [35, 40, 46, 47, 51]. However, since four of five studies using EQ-5D-Y did not report the standard errors of the kappa values or the percentage agreement values, the EQ-5D measure was excluded from the domain-level meta-analysis of agreement.

3.4.1 Inter-Rater Agreement for Overall Utilities

The overall ICC for all 24 samples using HUI2 and 3 with CHU9D was 0.49 (0.34–0.61) and without CHU9D was 0.48 (0.32–0.61). Figure 2 depicts the study-specific and overall estimates of ICC, their respective 95% CIs and the study weights (%). The test for homogeneity resulted in a Q test statistic of 196.18 (p < 0.001). The heterogeneity in the studies was high (I2 = 91%) due to the presence of high variability between studies.

Fig. 2
figure 2

Summary of the interrater reliability across studies. The forrest plot depicts the study-specific and overall estimates of ICCs, their respective 95% CIs and the study weight (%) for 24 studies obtained using a random effects model. ICCs intraclass correlation coefficients, CIs confidence intervals

Exploratory moderators such as type of measure, health condition, proxy, and the age composition of the children in the sample were used to potentially explain this heterogeneity. The moderators were categorised according to the (1) type of measure used—HUI2 (12 samples) or HUI3 (11 samples) or CHU9D (1 sample); (2) health condition assessed—cancer- (15 samples) or non-cancer-related (9 samples); (3) type of proxy used—parent/caregiver (16 samples) or health professional/teacher (8 samples); and (4) lower age limit of the sample—below 8 years (10 samples) or 8 years and above (14 samples).

HUI3 had an estimated ICC of 0.37 (0.18–0.53), much lower than HUI2, which had an estimated ICC of 0.58 (0.34–0.75). The overall ICC for cancer-related samples was 0.43 (0.27–0.57), whereas for samples with conditions other than cancer, including general health, it was 0.54 (0.28–0.73). The ICC estimate for parent proxies was 0.49 (0.31–0.63), whereas for health professionals it was only marginally lower at 0.47 (0.11–0.72). Samples that also included younger children had an ICC of 0.39 (0.33–0.44), which was lower than the ICC of 0.5 (0.44–0.56) with older children. However, none of the group differences were statistically significant and therefore did not suggest moderation by any of the included variables.

The results of the meta-regression showed that none of the explanatory variables were statistically significant, thus showing no significant differences in child and proxy agreement according to the type of measure, health condition experienced, proxy type and the inclusion of children below 8 years in the sample. The funnel plot and the funnel-plot test for asymmetry (p = 0.133) did not suggest any publication bias.

3.4.2 Inter-Rater Agreement for Domain-Level Health-Related Quality of Life

The estimated kappa and its 95% CI for HUI2 and 3 attributes is summarised in Table 6. In total, 36 samples for HUI2 and 68 samples for HUI3 were synthesised for the meta-analysis. The estimated kappa values for HUI2 attributes of ‘emotion’ (0.25), ‘cognition’ (0.3) and ‘pain’ (0.38), and the HUI3 attributes of ‘cognition’ (0.23), ‘emotion’ (0.27), ‘speech’ (0.3) and ‘pain’ (0.36) were the lowest. In contrast, there was higher agreement for the more easily observable physical- or function-related attributes such as ‘mobility’ (0.61) for HUI2 and ‘ambulation’ (0.64), ‘dexterity’ (0.65) and ‘vision’ (0.78) for HUI3. The heterogeneity was lower for HUI2 studies (I2 = 75%) than for HUI3 studies (I2 = 90%). Although no small-study bias was present in the analysis of HUI3 samples (p = 0.327), there was a possibility of such a bias using the HUI2 samples (p = 0.003).

Table 6 Domain (attribute)-level overall kappa estimates with their 95% CIs for HUI2 and 3

4 Discussion

To our knowledge, this is the first study to comprehensively examine the evidence relating to the level of agreement between child- and proxy-reported paediatric HRQoL using generic preference-based measures across health conditions. This study systematically reviewed the papers reporting agreement measures to describe the inter-rater agreement in the assessment of paediatric HRQoL by child self and proxy reports.

Thirty studies were identified that reported the agreement statistics between child self- and proxy-reported overall and/or domain-level HRQoL. Most of these studies showed poor inter-rater agreement for overall utilities. At the domain level, there were some important differences common to all the generic preference-based measures. In particular, the agreement between children and proxy respondents was weaker for psychosocial-related HRQoL domains and stronger for physical HRQoL domains. No studies that reported agreement measures between self- and proxy-reported overall utilities over time were identified. This is an important omission as repeated HRQoL assessments over time form critical inputs for the calculation of QALYs for CUA. Divergences in self- and proxy-reported childhood utilities over time may impact, potentially substantially, upon the results of economic evaluations and regulatory decision making for the recommendation of new pharmaceuticals/medical technologies.

It is unclear if the preference-based measure/s applied in the identified studies have any influence on the level of agreement between self- and proxy-reported paediatric HRQoL. In this review, we found a greater agreement with HUI2 than HUI3. There are two main differences between the measures. First, the two measures differ in their response levels. HUI3 has 5–6 response levels whereas HUI2 has 3–5 [65]. Intuitively, a higher inter-rater agreement would be expected with measures with fewer response levels if the inter-rater agreement depended on the response levels within the measure. However, a study evaluating the child and proxy agreement using the EQ-5D-Y-3L and -5L versions found a higher agreement with the five-response-level version than with three [66]. Second, HUI2 and HUI3 have different underlying constructs for the attributes with the same name. For example, in HUI2 the ‘emotion’ attribute assesses distress and anxiety, while the HUI3 frames ‘emotion’ in terms of happiness rather than depression [65]. Currently, there is insufficient evidence to investigate whether the discrepancy reflects this difference or is a coincidental finding.

The agreement for EQ VAS was lower than for the EQ-5D-Y domains. This may be attributed to the fact that the VAS and the domain-level responses are elicited using different response scales. The VAS has a response scale from 0 to 100, whereas each of the five domains are described using a 3- or 5-level response scale [3]. Hence, a higher discrepancy may be expected with VAS due to the much larger range for its response scale.

Proxy type used was found to have some influence on the level of agreement between self- and proxy-reported paediatric HRQoL. The findings of HRQoL studies conducted in a paediatric oncology setting suggest that the information obtained from the child, the parent and the health professional are generally complementary and valid [67]. However, Sprangers and Aronson concluded that health professionals generally tend to underestimate the pain and also, conversely, the overall HRQoL of the individual [68]. While able to accurately assess the patient's physical condition, health professionals often failed to consider the emotional and social components of HRQoL [69]. In line with previous studies in adult cancer patients where agreement was higher with close companions, the child/parent agreement in this review was also found to be higher compared with child/health professional agreement [70]. Moreover, mothers demonstrated a higher agreement than fathers. This gender disparity may be associated with their degree of involvement in childcare [71].

The level of inter-rater agreement decreases with more severe conditions [69]. A recent study in paediatric patients found that the agreement between children and caregivers was higher when their condition improved compared with when they were ill [66]. We found that cancer-related cohorts had a lower overall agreement than cohorts with or without health conditions other than cancer. Interestingly, a low inter-rater agreement was seen between children with no obvious health conditions and their parents. One study showed worse correlations between parents and healthy children than children with a history of cancer [43]. These findings should be explored in more detail to determine whether this is a demonstrable trend. Self and proxy agreement data in the assessment of mental illnesses remains scarce. Studies have examined HRQoL in children with mental or behavioural disorders using preference-based measures, but none have assessed the level of child/proxy agreement [72, 73].

Self-report using the EQ-5D-Y has been prescribed for children aged 8 years and older [3]. The use of HUI2/3 was not recommended for self-report in children under 12 years of age [65]; however, studies have reportedly used these measures for self-completion in children younger than the recommended age group [35, 45, 48]. The minimum age at which children can reliably and accurately self-report has not been conclusively identified yet and is likely to be influenced by a variety of factors (including the reading and comprehension abilities of the child, the measure/s being applied and the mode of completion) [6]. There also remains a gap in the literature exploring the potential for differential levels of agreement between proxies and children by age groups. A previous study in a sample of children aged 8–18 years has shown that agreement decreases with age [74]. In this review, one study reported the agreement statistics (Gwet’s AC1) for children (10–12 years) and adolescents (13–15 years) separately. In both groups, the correlation between child self- and proxy-reported domain-level HRQoL was strong and positive, with a marginally stronger association reported between adolescents and caregivers than children and caregivers [53]. Due to these inconsistent findings, further research is needed to determine if an age differential exists in the level of child/proxy agreement.

We found that 33% of the studies reported only the correlation coefficients that were synthesised to describe the inter-rater agreement in this review. The difference between agreement and correlation has been addressed in literature [19]. However, until recently, standalone correlation coefficients have been employed to assess agreement between child self and proxy report [75]. Correlation and agreement both measure the strength of association between two the variables of interest; however, the key difference is that agreement coefficients, in addition, account for the absolute agreement between the raters. Correlations may be high even if the ratings are not equal but only vary similarly. On the other hand, a perfect agreement would imply that all ratings, by each rater, are the same [14, 18]. Thus, correlation coefficients, if used, presented along with agreement statistics may provide a more comprehensive picture of the level of agreement.

This study has several limitations that are important to highlight. The inter-rater agreement for overall utilities and for the respective domains was quantitatively examined for only HUI2 and 3 for the following reasons. (1) HUI measures were widely used among the studies included in this analysis, with HUI3 being the most dominant. (2) Despite its relatively wide application, the majority of the identified studies using the EQ-5D-Y did not report the overall utilities, potentially due to the absence of an established preference-based scoring algorithm for the EQ-5D-Y to date. When reported, only the correlation (using Pearson’s r or Spearman’s rho) between the child self- and proxy-reported utilities was examined. While agreement was reported for the EQ VAS scores, they were not pooled due to paucity of evidence demonstrating the comparability of the VAS scores with the index scores. The EQ VAS scores were therefore not included in the meta-analysis. Furthermore, due to a lack of studies reporting the domain-level agreement between self and proxy reports of paediatric HRQoL, along with percentage agreement, the meta-analysis of the EQ-5D-Y domains was not feasible. (3) The analysis of the agreement level using the CHU9D and the QWB was also limited due to inadequate reporting of agreement statistics. Interpretation of the results of the meta-analysis is bounded by the presence of high heterogeneity between studies, which could not be explained by the subgroup analysis. Furthermore, due to practical resource constraints, we were only able to include articles published in the English language.

5 Conclusion

This systematic review summarising the agreement between child self and proxy rating of HRQoL using established generic preference-based measures generally found a poor inter-rater agreement. Convergence with child self-rating was more likely in the proxy assessment of paediatric HRQoL within domains with observable attributes e.g., physical health domains, than with less-observable attributes e.g., psychosocial domains. Further research to drive the inclusion of children in self-reporting their own HRQoL wherever possible and limiting the reliance on proxy reporting of children’s HRQoL is warranted.