Introduction

To inform rational decision making in health care, the results of economic evaluations have become increasingly important in the past decades. Cost utility analyses (CUA) have been recommended as the method of choice by various health authorities, such as NICE in the UK [1]. The primary outcome in a CUA is incremental cost per quality-adjusted life year (QALY). To estimate QALYs, life years are weighted by the utility of the health state experienced, where a value of 1 characterizes full health and 0 indicates a health state as poor as being dead.

There are various methods to obtain the required utilities for CUAs. One approach that has increasingly gained popularity is the use of multi-attribute utility instruments (MAUI) [2]. Widely used MAUIs include the EQ-5D [3] and the SF-6D [4]. The main advantage of MAUIs is that once utility weights have been determined for the set of health states covered, these weights can be applied in future CUAs without any further valuations.

MAUIs also provide a link between health economics and health-related quality of life (HRQOL) research, as they are frequently applied for the assessment of HRQOL. However, the most widely used MAUIs are generic utility instruments and thus somewhat limited in their coverage of HRQOL domains that may be relevant for decision making in specific patient populations.

Recently, a cancer-specific MAUI, the EORTC QLU-C10D, was developed as a collaboration of the Multi-Attribute Utility in Cancer (MAUCa) Consortium and the European Organisation for Research and Treatment of Cancer (EORTC) Quality of Life Group [5]. As a first step, the EORTC QLU-C10D, health state classification system was set up based on ten key dimensions of the EORTC QLQ-C30 [6], one of the most widely used cancer-specific quality of life (QOL) questionnaires. In a second step, country-specific utility weights (or value sets) for the QLU-C10D are being determined via valuation studies in general population samples in various countries. As health care systems and community attitudes towards health and illness vary across countries, the availability of country-specific value sets is essential.

Regarding the appropriate target group for the elicitation of health utilities, it should be noted that there is often a discrepancy between general population and patient preferences [7] and that ideally both perspectives should be taken into account [8]. In the present paper, we adopt the usual approach taken in health economics that CUAs are performed to guide societal decisions and hence should be primarily based on general population valuations [9].

In the first country-specific value set for the QLU-C10D, for Australia [10], discrete choice experiments (DCE) were used to elicit utilities, as this approach is considered advantageous in terms of comprehensibility and ease of application compared to classical approaches, such as the time trade-off method or the standard gamble [11, 12]. Subsequent QLU-C10D valuation studies for other countries are using the same methodology as the Australian study to facilitate comparability across countries.

The present paper deals with the determination of utility weights for Germany, as the first country within a series of QLU-C10D valuations in Europe. Valuations for several other European countries are underway. It should be noted that the German health authority, the Institute for Quality and Efficiency in Health Care (IQWiG), takes a critical attitude towards CUAs and recommends alternative strategies for economic evaluation of new treatments [13]. This controversy has been discussed in the literature [14, 15]. CUAs have nonetheless been performed in Germany despite the skeptical view of the IQWiG [16].

The German version of the QLQ-C30 is presently undergoing a revision of the wording of one of its response categories (level 3, “quite a bit”) as findings of a cross-national investigation demonstrated that the original German wording of this category was associated with a considerably lower severity level compared to the English label [17]. As QLU-C10D utilities are available not only for future CUAs but also for post hoc analyses of formerly conducted studies that used the QLQ-C30, determination of utility weights for both versions of the QLQ-C30 is necessary.

Thus, the principal aims of this paper are as follows:

  1. (1)

    Determination of general population utility weights for the German version of the QLU-C10D by means of the established DCE method approach, both for the old and the new wording of the response categories.

  2. (2)

    Comparison of the utility weights of the two German versions (old and new response wording) of the QLU-C10D. We hypothesized that utility weights for the response level “quite a bit” would differ significantly between the two versions.

Methods

EORTC QLQ-C30 and QLU-C10D

The EORTC QLQ-C30 [6] is a widely used cancer-specific QOL instrument. It comprises five functioning scales, nine symptom scales or items, and a global QOL scale. All functioning items and symptoms are rated on a 4-point Likert scale ranging from “not at all” to “very much.” The original German wording for the response category “quite a bit” (“mäßig”) was criticized for addressing a lower severity level than the corresponding English expression [17]. A new German version with revised wording for this response level (“ziemlich”) has almost completed the testing phase [18] and appears to be as a closer approximation of the severity level expressed by “quite a bit.” Hereafter, the two German versions will be abbreviated as QLQ-C30 German 1 (“mäßig”) and QLQ-C30 German 2 (“ziemlich”).

The QLU-C10D, the newly developed MAUI based on the QLQ-C30, consists of a health state classification system and an algorithm for the calculation of utilities for the health states defined by the system [5, 10]. The health states cover 10 key dimensions of the QLQ-C30: physical functioning, role functioning, social functioning, emotional functioning, pain, fatigue, sleep disturbances, appetite loss, nausea, and bowel problems. The severity of impairment in each dimension is expressed by the four categories used in the QLQ-C30: not at all, a little, quite a bit and very much. For use in the DCE, the survival time is also taken into account which can attain four distinct values: 1, 2, 5, and 10 years.

DCE valuation task

Valuation of the German QLU-C10D was performed using the same methodology as employed in the Australian valuations. In particular, the same DCE design was used for elicitation of utilities. This was based on a total of 960 choice sets which were determined by methods of optimal design theory to maximize efficiency in estimating model parameters [19]. Each respondent had to complete 16 choice sets which were randomly selected from the 960. In each choice set, the respondent had to select one of two scenarios, A or B, each consisting of a health state defined by the ten attributes of the QLU-C10D and a survival time (Table 1). To keep the burden for the respondent at a manageable level, only five attributes in a choice set differed between scenarios (highlighted in yellow), whereas the severity level of the remaining attributes was kept equal. The order of the ten attributes was randomized for each respondent (and kept constant for each respondent across the 16 choice sets). The survival time was always presented last. For details regarding the DCE refer to the original paper [10]. An example of a choice set is shown in the Appendix (Fig. 5).

Table 1 Health state classification system of the QLU-C10D

Valuation survey

Separate valuation surveys for the two German versions of the QLU-C10D were run (denoted as Germany 1, Germany 2). Both surveys were administered online and consisted of the following parts: introduction, informed consent, information on age and sex for quota sampling (for Germany 2), the EORTC QLQ-C30, the DCE valuation task, four feedback questions on the DCE, further socio-demographic and basic clinical questions, EQ-5D-5L [20], and the Kessler K-10 mental health questionnaire [21]. Recruitment and data assessment were contracted to a company specialized in the conduct of DCEs which was successfully engaged for the Australian valuations.

The translation of the attributes for the DCE was taken from the validated German version of the QLQ-C30. Validated German versions of EQ-5D-5L and Kessler K-10 were available. The remaining text of the survey was translated from Australian English into German by native speakers of the target language who were fluent in English. The translation procedure included forward and backward translations as well as feedback from in-country persons.

Sample

Online samples of approximately 1000 adults from the German general population were recruited both for Germany 1 and 2 surveys. The survey was sent out as a weblink for the respondents to complete at their leisure. Potential respondents are members of an online panel of persons willing to complete surveys for small payment. Participants were eligible if aged 18 to 80 years. Representativeness of the sample was investigated by comparison with national census data [22]. As the age and sex distribution in the Germany 1 sample differed considerably from that in the general population, quota sampling by age and sex was used in the Germany 2 valuation to achieve representativeness for these variables.

Additional sources had to be accessed to obtain estimates the population prevalence of chronic diseases in persons aged 18–80, as no differentiation within the group of persons aged 65+ was made in the official statistics. Estimates were derived indirectly from articles on the prevalence of chronic diseases in the elderly in Germany [23] and Sweden [24].

Sample size considerations

Determination of sample size was based on the width of the confidence intervals (CI) for the utility decrements found in the Australian valuation study with a total of 1833 respondents. Using the notation [ud, u + d] for 95% CIs of utility decrements (where u is the estimated utility decrement and d is half the width of the CI or the margin of error), all values of d in the Australian study were ≤ 0.031, and for all domains but one d was below 0.0245. For the German valuation study, we wanted to ascertain that utility weights can be estimated with d ≤ 0.05 throughout. When using a sample size of 1000 respondents and allowing for the possibility of a slightly larger spread due to a more heterogeneous response pattern (factor 1.2), the corresponding values are d ≤ 0.05 for all domains and d ≤ 0.0396 for all but one domain, i.e., the above condition regarding the error margin is satisfied.

Statistical analysis

Statistical analyses were run using SPSS, version 24, and Stata, version 13. The latter software was used for advanced DCE analyses, in particular for the mixed logit model.

Testing representativeness and analyzing feedback questions

For comparison of the valuation sample with national statistics data with regard to socio-demographic and clinical characteristics, the Chi-square test was used. Feedback questions were analyzed by descriptive statistical methods.

DCE analyses

The main part of the analysis consisted of the determination of utility weights for the QLU-C10D using the approach proposed by [11]. The basic model for the utility of option j (scenario A or B) in choice set s for respondent i is given by

$$U_{isj} = \alpha {\text{TIME}}_{isj} + \beta X^{\prime}_{isj} {\text{TIME}}_{isj} + \varepsilon_{isj} ,$$

where TIMEisj is the survival time presented in option j and X′isj is a set of dummies related to the levels of the corresponding health state. The errors εisj were assumed to be independent and identically Gumbel distributed. The parameters α (scalar) and β (vector) were estimated by conditional logistic regression. Regression weights were then converted into utility decrements consisting of the ratio of the health state parameters β and the time coefficient α to reflect the trade-off between health-related QOL and length of life [11]. This method was also used to analyze the Australian QLU-C10D valuation data [10]. To allow for correlated observations within respondents, a random subjects term was included in the model using generalized estimation equation (GEE) models with first-order autoregressive covariance structure. We also compared the utility weights of the complete sample with the utility weights obtained when omitting those respondents who perceived the DCE as difficult or very difficult (who may therefore have given less reliable replies). For this purpose, the mean absolute difference (MAD), the Pearson correlation coefficient, and the intra-class correlation coefficient (ICC) between the two sets of utility decrements were determined, both for Germany 1 and 2.

Comparison of utility decrements of the two German QLU-C10D versions

For this purpose, the DCE data of the two German samples were pooled. Comparison of the two versions regarding the effect of response level on utility decrements was performed by a series of likelihood ratio (LR) tests.

Modifications of DCE analysis and additional analyses

  1. (i)

    Some of the utility decrements obtained in the analysis did not show a monotonic pattern, i.e., increasing severity coinciding with increasing decrements. When this occurred, the non-monotonic levels were combined. This restriction has been standardly imposed in previous studies [4, 10, 25].

  2. (ii)

    As the distribution of age and gender in the Germany 1 valuation sample differed significantly from the German population, a weighted analysis was performed in addition to the unweighted one. For this, the sample was stratified by age group and sex, imposing weights to achieve representativeness for each combination of the two variables [19].

  3. (iii)

    Additional analyses of the DCE data were performed by means of mixed logits, for both the Germany 1 and 2 datasets. In this model, it was assumed that coefficients α and β were drawn from a distribution, thus allowing for heterogeneous preference patterns between respondents. More details may be found in the paper on the Australian valuations [10]. As the mixed logit model deals with the distribution of parameters rather than with point estimates, its use for estimating utility decrements (which are derived taking ratios of α and β) entails considerable statistical problems. As noted by Gu et al., the distribution of the ratio can have an extremely wide spread when the denominator is close to zero (and the mean can be extremely high) [26]. Hence, we used the conditional logit model for estimating utility decrements, which is in line with the usual practice, but present the mixed logit results for those interested which dimensions demonstrate considerable preference heterogeneity.

Results

Complete cases and dropouts

An overview of the respondent flow is given in Appendix (Table 5). A proportion of those invited to the survey dropped out immediately, upon reading the description of the survey (14.3% and 10.0% for Germany 1 and 2, respectively); presumably it did not interest them. Further, for the Germany 2 survey, 34.4% were excluded as they were excess to quota sampling for their age and sex; there were no such exclusions for Germany 1, as there was no quota sampling. Of the remainder, completion rates were close to 90%. Specifically, 1002 of 1135 (88.3%) respondents entering the Germany 1 valuation component of the survey and 1016 of 1124 (90.4%) respondents meeting the quota-sampling criteria for the Germany 2 valuation completed all the survey components. These ‘complete case’ respondents form the analysis dataset for all results reported below.

Socio-demographic and clinical data: comparison with national statistics

An overview of socio-demographic and clinical characteristics of the two valuation samples is given in Table 2. The Germany 1 sample showed significant departures from the general population in the distribution of respondents’ age, gender, and education. In particular, the sample included a smaller fraction of persons in the oldest age group, a smaller proportion of women, and a larger percentage of highly educated people compared to the German general population. As some of these differences were quite substantial (≥ 10%), we performed a weighted analysis of the DCEs in addition to the standard analysis. Quota sampling achieved population-representative distributions of age and sex for the Germany 2 sample, and it exhibited significant deviations from the general population only for education, again with a larger proportion of more highly educated persons.

Table 2 Distribution of socio-demographic and clinical characteristics—Germany 1 and Germany 2

Feedback questions on the DCE

More than two-thirds of the respondents (Germany 1: 69.1%, Germany 2: 67.6%) regarded the presentation of the DCE as clear or very clear, and only a minority (12.1% vs. 11.5%) as unclear. However, almost half of the respondents (47.6% vs. 46.6%) considered the DCE task (choosing between situation A and B) as difficult or very difficult, only about a quarter (23.1% vs. 26.1%) found it easy or very easy. Even so, DCE results remained fairly stable when those respondents who perceived the task as difficult or very difficult were excluded from analysis. Thus, the MAD for the two sets of utility decrements (all respondents vs. those respondents who did not find the task difficult) was 0.0131 and 0.0178 for Germany 1 and 2, respectively. Pearson correlations between the two utility sets were r = 0.962 and r = 0.951, and ICCs were 0.957 and 0.940 for the two German versions. Regarding the response strategy used, almost half of the respondents replied that they concentrated on a few aspects or on those highlighted in yellow (45.6%, Germany 1 and 2 pooled), whereas 40.9% stated that they considered most or all aspects. Only a small fraction used other strategies (5.8%).

Raw utility decrements (without correction for non-monotonicity)

Findings of the DCE analysis are shown in Figs. 1 and 2 and in Tables 3 and 4 (for Germany 1 and Germany 2, respectively). Numbers displayed are utility decrements for each dimension and severity level (a little, quite a bit, and very much). Note that utility decrements for the level “not at all” are 0 by definition. Utilities of the individual health states can be obtained by subtracting the respective utility decrements or a linear combination of them from 1.

Fig. 1
figure 1

Utility decrements for the German version 1 of the QLU-C10D (raw decrements without adjustment for monotonicity)

Fig. 2
figure 2

Utility decrements for the German version 2 of the QLU-C10D (raw decrements without adjustment for monotonicity)

Table 3 QLU-C10D utility decrements for German version 1 (Response level 3 = “mäßig”)
Table 4 QLU-C10D utility decrements for German version 2 (Response level 3 = “ziemlich”)

The largest utility decrements were observed for the domain of physical functioning (PF), with decrements of − 0.083, − 0.162, and − 0.278 for the three levels of restrictions in PF for Germany 1 and similarly sized decrements for Germany 2. For both German versions, the second largest utility decrements were seen for pain, followed by role functioning and social functioning. The decrements for the other domains were considerably lower, with nausea and bowel problems following in fifth and sixth place. Utility decrements were smallest for emotional functioning, fatigue, sleep disorders, and lacking appetite, attaining statistical significance only for a few severity levels.

Constrained utility decrements

In a total of six instances, the utility decrements for Germany 1 were not monotonically ordered (see Fig. 1). Most deviations from monotonicity were small and non-significant, only one reached statistical significance (for sleep disorders, χ2 = 5.76, p = 0.016). In all these cases, utility decrements for response level 2 (a little) and 3 (quite a bit) were reversed. To obtain a set of utility weights fulfilling the monotonicity condition, the analysis was rerun constraining levels 2 and 3 to a single utility decrement where required (see Fig. 3 and right-hand column of Table 3). For Germany 2, violations of monotonicity were observed in three cases, none of which reached statistical significance. In particular, no reversal of the levels 2 and 3 occurred. Monotonicity-constrained utility decrements for Germany 2 are shown in Fig. 4 and in the right-hand column of Table 4.

Fig. 3
figure 3

Utility decrements for the German version 1 of the QLU-C10D (with adjustment for monotonicity)

Fig. 4
figure 4

Utility decrements for the German version 2 of the QLU-C10D (with adjustment for monotonicity of levels)

Comparison of Germany 1 and 2

Likelihood ratio testing revealed a significant overall difference between the utility decrements of the two German versions (χ2 = 68.8, df = 30, p < 0.001). This was mainly due to an effect of response level 3, i.e., that level for which the wording of the two German versions differed (χ2 = 63.1, df = 10, p < 0.001). The corresponding coefficients all had negative signs indicating that the estimated utility decrements were larger for Germany 2 than for Germany 1 across domains, in accordance with our hypothesis. In addition, there was a significant effect of response level 4, “very much” (χ2 = 26.2, df = 10, p = 0.003). All but one of the beta coefficients had a positive sign showing that utility estimates for response level 4 were generally smaller for Germany 2 than for Germany 1. No significant interaction between level 2 and version was found. More details can be found in Appendix (Table 6).

Calculation of QLU-C10D utilities

The basis for the calculation of QLU-C10D utilities both for Germany 1 and Germany 2 are the monotonically ordered utility weights as displayed in the right-hand side of Tables 3 and 4, respectively. Utilities are obtained by subtracting the respective utility weights from 1. For instance, the utility for the health state (23311 11111), i.e., little restriction in PF, quite a bit of restrictions in RF and SF, and optimal health in all other dimensions, amounts to

$$1 - 0.084 - 0.078 - 0.039 = 0.799,$$

in the metric of Germany 1 and to

$$1 - 0.062 - 0.085 - 0.059 = 0.794$$

in the metric of Germany 2.

The utility of the worst possible health state (4444444444) takes the value − 0.136 for Germany 1 and the value − 0.001 for Germany 2.

Additional analyses

Weighted analysis for Germany 1

A weighted analysis of the DCE data for Germany 1, adjusting for non-representativeness with regard to age and sex, yielded very similar results as the unweighted analysis. With one exception the utility decrements of the weighted analysis differed from the original ones by less than 0.01 in either direction. A slightly larger difference was observed for level 3 of the SF domain (estimated decrements of − 0.037 and − 0.025 for the unweighted and weighted analyses, respectively, i.e., a difference of 0.012 points).

Mixed logit analyses

Findings of the mixed logit analyses for Germany 1 and 2 are shown in Appendix (Tables 7, 8). Regarding non-monotonicity, basically the same patterns were found as for the conditional logistic regression analysis above. The majority of the estimated standard deviations of the model parameters were significantly greater than 0 both for Germany 1 (29 parameters of 31) and Germany 2 (27 of 31) reflecting considerable heterogeneity in individual respondents’ preferences.

Discussion

The EORTC QLU-C10D is the first cancer-specific utility instrument for which valuations are being performed in multiple countries internationally using a standard valuation method. Germany is the second country after Australia [10] for which QLU-C10D utility weights become available. As the German version of the underlying parent QOL instrument is presently undergoing a revision of the wording of one response category, utility weights were determined for both versions (Germany 1 and 2).

Online samples were used for both valuations. Quota sampling by age and sex achieved representativeness for these two key demographic variables in the Germany 2 valuation survey. Representativeness of this sample was generally good for other characteristics, except for a surplus in respondents with high educational levels. Lack of quota sampling in the German 1 valuation resulted in significant non-representativeness for both age and sex. However, the effect of this imbalance on the derived utility decrements was small as shown by weighted analysis.

Generally, our findings on German QLU-C10D utility weights look plausible and agree with our expectations. In particular, DCE analyses resulted in a meaningful order of dimensions by size of utility weights. Physical functioning received the largest utility weights, followed by pain, role functioning, and social functioning. The cancer-specific dimensions of nausea and bowel functioning came in fifth and sixth place. This agrees well with the pattern found in the Australian QLU-C10D valuation where only one dimension, emotional functioning, received considerably larger utility weights than in our valuation [10].

Similar to the Australian QLU-C10D valuation, three dimensions particularly relevant to cancer patients were given fairly small utility weights: fatigue, appetite loss, and sleep disturbances. It may be that the relatively low utility decrements for these dimensions reflect a lack of experience of these symptoms in the general population. This calls for valuation studies to be performed in patient populations in order to scrutinize the above assumption. As a first step towards this aim, we have started with patient valuations for the QLU-C10D in Austria after completing valuations in the Austrian general population (not yet published).

Small utility weights were also observed for the dimension of emotional functioning. This may have to do with the German wording for the key item used to describe this dimension, “depressed.” It was translated as “niedergeschlagen” (similar to downcast or moody) in the QLQ-C30 which is probably perceived as weaker than the English word “depressed” by respondents.

Regarding monotonicity of utility weights, there were remarkable differences between the two German versions. While only a few small deviations from monotonicity were seen in the Germany 2 valuation, considerable problems with monotonicity occurred for Germany 1. All of these involved a reversal of the levels “a little” and “quite a bit” indicating that respondents had difficulties distinguishing these two response levels in the Germany 1 valuation task. Change of the German wording of the category “quite a bit” to a stronger expression obviously solved the problem as no reversal of these two categories occurred in the Germany 2 valuation.

A number of CUAs in oncology have been conducted in Germany or in multi-center studies including Germany in recent years. Some of these used generic utility instruments [27, 28], others obtained health utility values from the literature [29, 30], by expert ratings [31] or via mapping procedures [32]. The new utility instrument offers a valuable alternative as patient utilities can directly be obtained from the parent instrument, the QLQ-C30, which is routinely used in many oncological studies conducted in European countries, including Germany. Moreover, as a cancer-specific utility instrument it has the potential to capture cancer-specific treatment effects better than generic MAUIs like the EQ-5D, although this is yet to be tested empirically.

Our study has some limitations. First, there is the potential that our sample is non-representative of the entire population due to their self-selection into the online panel. Our quota sampling for age and sex ensured that our sample was representative on these two key demographics. However, there was an over-representation of respondents with high educational levels in both valuation surveys. This may be a typical characteristic of online samples as it was also found in the Australian survey [10]. It is important to note that the effect of education on health utility values has been found to be consistently small in a systematic review on EQ-5D valuation studies [33] and in other valuations [34, 35]. Second, some non-monotonicities were encountered in the Germany 1 survey. Consistent with the practice used by other researchers, we imposed constraints on model parameters to remove non-monotonicities [10].

In summary, the present paper provides utility weights for the new cancer-specific utility instrument, the QLU-C10D, for an economically important European country, Germany. This is of relevance for future cancer related CUAs performed in Germany. At the same time, the present paper is to be regarded as one piece of research within a larger program with a broader, international perspective. Thus, QLU-C10D valuations for a number of other countries, including the UK, France, Italy, Poland, Canada and the US, are presently underway. In their entirety they should provide a basis for more targeted decision making in cancer care.