FormalPara Key Points For Decision Makers
Table 1

1 Introduction

Health technology assessment is developing rapidly in Central and Eastern Europe (CEE): e.g. in Bulgaria, Czechia, Hungary and Croatia [1,2,3,4]. In Poland, it is compulsory when applying for drug reimbursement [5]. The Polish Health Technology Assessment Agency (AOTMiT) has issued approximately 1800 recommendations since its foundation in 2006, and has assessed nearly 500 health technology assessment reports since the introduction of the current Reimbursement Act in 2012 [6]. Based on this regulation, cost utility is the preferred form of pharmacoeconomic analysis with the official threshold for the cost per quality-adjusted life-year updated yearly [7]. AOTMiT recommends EQ-5D for the purposes of valuing health states and calculating quality-adjusted life-years [8].

The EQ-5D questionnaire consists of a descriptive system and a visual analog scale [9]. The descriptive system contains five dimensions: mobility (MO), self-care (SC), usual activities (UA), pain/discomfort (PD) and anxiety/depression (AD). In the original version (EQ-5D-3L), each dimension has three levels: no, some or severe problems; whereas there are five levels in the new version (EQ-5D-5L): no, slight, moderate, severe or extreme problems [10, 11]. The previous two mentioned questionnaires define 243 and 3125 health states, respectively. By attaching disutility to each of the levels in each dimension, it is possible to calculate a single value (EQ-Index) for every health state, forming a value set [12]. EQ-5D-5L demonstrates better measurement properties than EQ-5D-3L [13, 14].

There are only two published EQ-5D-3L value sets in CEE countries: Slovenian [15] and Polish [16], and no value set for EQ-5D-5L. Although it has been possible to use EQ-5D-5L in Poland [17,18,19,20], using the mapping-based cross-walk value set [21, 22], the lack of a directly measured EQ-5D-5L value set limited the implementation within the decision-making process.

Our objective was to derive a Polish tariff for the EQ-5D-5L descriptive system, using a standardised approach developed by the EuroQol Group. Such a value set could also be used by other CEE countries that are too small to finance valuation studies, yet are culturally similar and are likely to have congruent health preferences.

2 Methods

The methods and analyses reported in this paper comply with the CREATE guidelines for reporting valuation studies of multi-attribute utility-based instruments [23].

2.1 Study Design

Quota-based sampling was applied using Polish census data from November 2014, based on personal identification number registry (PESEL) and Central Statistical Office data on education [24]. A representative sample in terms of age, sex, education, geographical region and the size of the locality was obtained from Polish residents aged 18+ years. Individuals were recruited through a mixed strategy (public locations, personal contact). Interviews were conducted in public venues or at participants’ homes. Respondents received a financial incentive (voucher of value equivalent to €8).

The study design followed a valuation protocol: EuroQol Valuation Technology (EQ-VT 2.0). It includes software for conducting computer-assisted personal interviews, an interviewer script, standardised training materials, data quality-control procedures and an Excel-based quality-control tool enabling monitoring of protocol compliance, interviewer effects and the validity of the collected data [25].

2.2 Valuation Interview

Computer-assisted personal interviews consisted of four main parts: introduction, composite time trade-off (TTO) valuation, discrete choice experiment (DCE) valuation and country-specific background questions. After a general introduction and explanation of the purpose of the study, the respondents self-reported their health using the EQ-5D-5L questionnaire and answered basic background questions (about age, sex and experiences of severe illness).

In the composite time trade-off valuation, a composite approach was used: starting with the standard TTO (to find the number of years in full health equivalent to 10 years in an impaired EQ-5D-5L state) and shifting to a ‘lead time’ TTO when participants considered the state to be worse than dead (see detailed descriptions [26,27,28,29]). The resulting TTO values range from − 1 to 1 in 0.05 increments (the smallest tradable unit being 6 months in duration).

The TTO part of the interview consisted of an explanation of the TTO procedure (the ‘being in a wheelchair’ example and three practice states: mild, severe and difficult to imagine), proper TTO valuation of ten EQ-5D-5L health states, a structured TTO debriefing and the TTO feedback module. Each respondent was presented with the rank ordering of health states derived from previous responses, to indicate states for which they were not happy with the ranking (though there was no possibility of re-evaluation).

The TTO experimental design included 86 EQ-5D-5L health states distributed into ten blocks that were balanced in terms of severity of states. The health states used in EQ-VT 2.0 were selected using a Monte Carlo simulation [30]. Each block included one of five very mild states (only one dimension at level 2 and all others at 1), the most severe state (‘55555’) and eight intermediate states. Respondents were randomised into one of the ten blocks; the health states were presented in a random order.

In the DCE valuation task, participants were presented with a pair of EQ-5D-5L health states with no duration specified (labelled A and B) and asked to indicate which they consider ‘better’ [30,31,32]. This part of the interview consisted of instructions regarding the task, the valuation of seven pairs and a structured debriefing.

The DCE experimental design included 196 pairs of states randomly divided into 28 blocks, which were identified using an efficient Bayesian design. The blocks were similar in terms of severity, assessed by the sum of the level scores of the health states (i.e. the misery index). Participants were randomly assigned to one of the blocks. The question order and left-right positioning of states were randomised.

The set of Polish country-specific questions covered: priorities in TTO valuations (length or quality of life), general health using an SF-1 question from the SF-36 questionnaire [33], comorbidities, potential concerns during severe illness, religiosity and beliefs, relationship status, childcare responsibilities, professional status and financial situation. In accordance with the EQ-VT protocol, the minimal recommended sample size for EQ-5D-5L valuation studies is N = 1000 (see the detailed description [30]). Given a planned experimental arm of our research, we established the basic target sample size at N = 1250 (the methods and results of the experimental substudy will be reported elsewhere).

2.3 Quality Control and Data Analysis

We excluded (1) interviews of suspicious quality (‘flagged’ interviews; for a detailed description of quality-control procedures see Electronic Supplementary Material [ESM] 1), (2) the first ten interviews conducted by an interviewer not meeting the minimum quality criteria (at least seven unflagged interviews) and (3) individual TTO valuations when marked by the respondent in the Feedback Module as not adequately representing their health preferences. No individual DCE valuations were excluded. Descriptive statistics were used to summarise the respondent’s characteristics and responses to the TTO and DCE tasks.

2.4 Modelling

2.4.1 General Approach

Below, we present the general approach (dependent/independent variables, model-selection criteria, estimation technique and the building blocks of the model specification under consideration). The formal specification is presented in ESM 1, Online Resource 2.

We based the final model on data from both elicitation techniques (often referred to as a hybrid approach). In the recent literature, all three approaches are used: TTO only [34], DCE only [35] or both [36,37,38,39]. As it remains unknown if one clearly outperforms the other, we deemed it safest to have both of them impact the value set (which necessarily worsens the model fit). Therefore, there are two dependent variables: the reported utility of a state (for TTO) and the choice made from a pair of states (for DCE). The states’ dimensions are taken as independent variables. In the process of constructing the final model, several specifications were tested: the choices were based on statistical criteria, pragmatic reasons (what the estimation results are used for) or our beliefs concerning how the elicitation tasks work.

In the estimation process, we used a Bayesian approach [40], as we find it more intuitive and flexible to work with a code (JAGS model run from within R, the code in ESM 2) directly describing the data generation process. To let the data speak, we used non-informative priors. In the estimation, we used a Markov-chain Monte Carlo simulation with, respectively, 2000, 30,000 and 20,000 adaptive, burn-in and actual iterations (2000, 20,000 and 10,000 for the intermediate models), no thinning and four chains. The medians of posterior distributions were used as point estimates, and 2.5 and 97.5 percentiles to construct 95% credible intervals. The model fit was assessed based on deviance and penalised deviance (deviance information criterion [DIC]). Potential scale reduction factors were monitored to diagnose convergence for individual parameters [41].

We only used main effects, i.e. no interactions between dimensions. This was a pragmatic decision, undertaken to ensure the final model may also be useful when only partial information is available (e.g. marginal distributions of levels for each dimension separately) [42]; for similar reasons, models with no constant term were preferred (also supported by results).

We tested (and utilised in the final model) the random parameters approach: the disutilities of dimensions/levels differ between individuals. Not only do we find this assumption intuitive but in addition the usefulness of random parameters (and the choice of specific distribution) was confirmed by DIC. Nevertheless, to limit random noise and the number of parameters, and also to avoid technical assumptions (the logical ordering of levels), we assumed it is the importance of each dimension (the disutility of level 5) that is distinctive for each individual, while the relative importance of each level is fixed across individuals (somewhat resembling the idea of simplifying how relative level importance is modelled [43]).

It is not possible in TTO to report a utility lower than − 1. Hence, we tested (and used in the final model) censoring: the observed −1s are treated as ≤−1. Some authors use censoring at 0 (where TTO is changed for lead time TTO) or at 1 (in TTO, a value greater than 1 cannot be reported) [38], which we find unconvincing. Regarding censoring at 0, negative values are possible in the protocol used, and modelling an endogenous self-censoring process would require assumptions (is a given zero the true utility or the effect of censoring?). Being unable to decide if a state is worse than dead is not equivalent to being unable to report <0 utility. Regarding censoring at 1, values above 1 are impossible, not only owing to the protocol but also because of the logical construction of the descriptive system and how the utility values were normalised.

Typically (and in our dataset), there is more variability in responses to more severe states (with lower utility, on average). This may be explained by the random parameters approach, as used in the present paper. Nonetheless, we find it plausible that for a given individual (the importance of dimensions known) there is an additional error term in TTO responses, and that this error tends to be larger for more severe health states (intuitively, for a state whose true utility for a given individual is close to 1, there is little room for a larger error). Therefore, we assumed that the scale parameter of the distribution increases with the theoretical disutility. Specifically, we used a generalised t-Student distribution with the scale and the number of degrees of freedom treated as parameters, allowing for fat tails (but also having a normal distribution as an asymptote).

In the DCE part, we assumed the probability of one state being chosen is a function of the difference in utilities, as is typically done. In the standard approach, this dependence is given by the cumulative distribution function of the logit distribution. Instead, based on the previous findings [44] and the DIC, we used the Cauchy distribution.

Previous research suggests that people with religious beliefs may misrepresent their preferences in TTO tasks, owing to an unwillingness to trade life-years—interpreted as a reporting bias, rather than a difference in preference [45]. For this reason, we introduced a parameter that scaled down the disutilities for religious respondents (separately for TTO and DCE), to disentangle the underlying and the reported preferences. In the final model, the scaling was not found in the DCE part, confirming the above interpretation.

2.4.2 Intermediate Models

We constructed several models sequentially, introducing additional building blocks in succession, and controlling for the DIC improvement, potential scale reduction factors and for whether the 95% credible interval contained a neutral value (i.e. a form of statistical significance). In this paper, we present the results of some of the intermediate steps (all based solely on TTO data):

  • M1—panel random-effects approach, with heteroscedasticity-robust standard errors;

  • M2—fixed parameters Bayesian model, with no constant term;

  • M3—random parameters Bayesian model;

  • M4—as M3, with error depending on the theoretical disutility via a t-Student distribution;

  • M5—as M4, with scaling as a result of religiosity.

We decided not to present the intermediate steps of the DCE-only part, as the parameters would require some anchoring (for more details on this issue, see [46]). However, as in the DCE part, we monitored the impact of modelling assumptions on DIC.

2.5 Value Set Comparison

There are three EQ-5D value sets available for Poland: EQ-5D-3L [16], EQ-5D-5L mapping-based cross-walk [22] and the present, directly measured EQ-5D-5L value set. To compare the utility values, we used three methods. First, we estimated the kernel density function of the utility values. Second, we identified the median and the worst levels between the EQ-5D-3L and EQ-5D-5L systems and we presented the utilities for all states. In the ESM 1, Online Resource 6, we additionally present the scatter plot to illustrate the relationship between the EQ-5D-5L value set and the other value sets.

3 Results

3.1 Sample Characteristics

From June to October 2016, 15 interviewers conducted 1570 interviews. The mean interview time was 41.1 minutes. In total, 2.3% of interviews were flagged. After excluding interviews with experimental TTO blocks (the results of the study will be reported elsewhere), 29 flagged interviews and six interviews that failed to meet the minimum quality criteria, data from 1252 respondents (52.5% female) aged 18–91 years (mean 46.2; standard deviation 17.6) were available (Table 1).

Table 1 General characteristics of respondents

The sample was representative of the Polish population in terms of age, sex, educational background, employment status, size and geographical location of the place of residence (Fig. 1). It was also similar to the Polish population in terms of health as measured by the EQ-5D-5L descriptive system, EQ visual analog scale and SF-1 [20].

Fig. 1
figure 1

Geographical representation of respondents in the Polish EQ-5D-5L valuation study

3.2 Data Characteristics

In total, 12,520 individual TTO valuations were available, with a mean number of 250 (standard deviation 6.7) observations per mild health state (misery index 6) and a mean number of 125 (standard deviation 5.3) observations for other 80 health states. In TTO, in 10.7% of the experiments, the time was not traded, and eight respondents (0.6%) did not trade for any state (an additional four respondents valued all the states at the same level, in each case with a utility of 0.95). In 1552 (13.5%) experiments, the valuations were considered worse than dead. In 271 (2.4%) and 784 (6.1%) cases, a utility of 0 and − 1 was reported, respectively. The average utility of the 55555 state in TTO was − 0.408 (33.5% at − 1). ESM 1, Online Resource 3 and Fig. 2 for the observed TTO values.

Fig. 2
figure 2

Distribution of observed time trade-off (TTO) values

In total, 1040 health states (8.3%) were indicated by the respondents in the feedback module as not revealing their true preferences in hindsight and were removed, leaving 11,480 TTO valuations for modelling. Using the feedback module reduced the number of respondents with inconsistencies related to health state 55555, from an initial 49 (3.9%) to 22 (1.8%). In the DCE data (8764 DCE pairs), 36 respondents (2.9%) presented with a suspicious response pattern (choosing states on the left or right or regularly alternately) but were not excluded from the modelling.

3.3 Preferred Model (Polish EQ-5D-5L Value Set)

In the final model, the estimated decrease of utility for level 5 amounts to: 0.314 (MO), 0.264 (SC), 0.205 (UA), 0.575 (PD) and 0.232 (AD). For example, the relative weights of levels 2–4 in UA are: 0.112 (i.e. 0.023/0.205 in Table 2), 0.195 and 0.471, while in PD: 0.052, 0.087 and 0.455. The intermediate models differ slightly (Table 2). The disutilities increase when accounting for the impact of religion and censoring (both motivated by statistical criteria). In the final value set, we get u(22222) = 0.873, u(33333) = 0.800, u(44444) = 0.296, and u(55555) = − 0.590, as compared to u(22222) = 0.716 and u(33333) = − 0.523 in the Polish EQ-5D-3L tariff. We present the complete results, alongside more technical parameters, in ESM 1, Online Resource 4, a practical example of how to use a scoring algorithm to estimate the value for a health state in ESM 1, Online Resource 5 and all 3125 values for the Polish EQ-5D-5L value set, as well as an index calculator, in ESM 3.

Table 2 Modelling results

3.4 Comparison of Polish Value Sets

The kernel density plots (Fig. 3) and the utility values for the individual states (Fig. 4) illustrate the high degree of similarity between the three Polish value sets. The new descriptive system is also more sensitive to a slight worsening in health: in the results, there are more utility values close to 1. In all likelihood, because of a higher number of health states, the distribution of the utility value is unimodal for EQ-5D-5L, while bimodal for EQ-5D-3L (though this comparison is made within the domain of health states, not of individuals, and thus it is a property of the descriptive system more than the value set itself). The new value set has a slightly lower worst utility (− 0.590; − 0.523, for both the EQ-5D-3L and the cross-walk), which is intuitive in view of the five-level system, but also in terms of accounting for the censoring and the bias from religious respondents in the present modelling.

Fig. 3
figure 3

Kernel density functions for the three Polish value sets (EQ-5D-5L directly measured is indicated by a solid line; EQ-5D-5L cross-walk is indicated by a dashed line; EQ-5D-3L is indicated by a dotted line)

Fig. 4
figure 4

Utility values for all states from three Polish value sets ordered by EQ-5D-5L (EQ-5D-5L directly measured is indicated by a solid line; EQ-5D-5L cross-walk is indicated by a solid light grey line, EQ-5D-3L is indicated by dots)

The importance of the dimensions scarcely changed: PD, MO and SC are the most significant, followed by AD and UA in the present value set, whereas by UA and AD in the previous two value sets. In Table 3, additional descriptive statistics are presented. Importantly, some of them are primarily influenced by the descriptive system, rather than the value set (e.g. the percentage of states with negative utility depends on the utilities assigned to health states, and also on how many severe health states are present in a descriptive system).

Table 3 Comparison of three Polish EQ-5D value sets

4 Discussion

In this study, we followed an official EQ-VT protocol, performed over 1200 computer-assisted face-to-face interviews, collected TTO values for 86 EQ-5D-5L health states and DCE choices for 196 pairs of states, and estimated the Polish EQ-5D-5L value set using both elicitation tasks. Our final model accounted for random parameters (respondent heterogeneity), error scaling (greater noise for more severe states), censoring at − 1, unwillingness to trade in TTO by religious participants and non-logit distribution in DCE. All these elements of the model were added in response to the statistical considerations. To the best of our knowledge, two elements are novel: the impact of religiosity and error scaling. We find the latter one rather intuitive; the variance of noise increasing with severity may partially explain why there is a weak relationship between the misery index and the disutility for the negative utility values [47]. The former element is probably the most controversial assumption in our model, and our decision to use it followed the reasoning presented in [45]. It is important to stress that correcting for the impact of religiosity does not aim at neglecting the preferences of religious individuals, but at correcting for how they may be biased in the TTO task (and how the elicitation task differs from what the resulting utilities are used for; not to actually shorten an individual’s life but to trade-off benefits between different individuals).

There are two more arbitrary decisions we made in the modelling. First, we decided to combine TTO and DCE data. We believe that provided there is no consensus on whether one method is clearly better (not in terms of cost or ease of application but the quality of the results) using both is the safest approach. Second, we decided to use a simple model with no constant and no interaction terms. As mentioned above, that makes the final results more applicable to situations where only limited information is available (e.g. only marginal distributions of levels in individual dimensions). To represent respondents’ answers more accurately (in the sense of predictive validity), a more complex model would probably have to be used (e.g. accounting for a non-linear time preference [44]). In this sense, there is a trade-off between trying to represent the data faithfully and using a specification that can be subsequently easily used.

The assumptions resulted in the theoretical value of u(55555) = − 0.590, visibly lower than the average utility elicited in TTO, i.e. − 0.408. This difference stems from three elements of our model. First, censoring leads to interpreting observed − 1s as effectively possibly much lower than − 1 (33.5% of TTO tasks for 55555 ended by assigning − 1). Second, introducing the impact of religiosity in TTO tasks results in effectively assuming that the true disutility is larger than the observed one. Third, by considering the random noise as having a larger variance for severe states, we make the parameters less driven by the actual observations for the severe states. Nevertheless, the final utility for the pit state is similar to the one in the EQ-5D-3L value set (hence, the cross-walk), and the slight decrease is intuitive in view of the larger number of levels.

Regarding the final value set, despite the fact that it describes significantly more possible health states (3125 vs. 243), it is similar to the Polish EQ-5D-3L value set in terms of a minimum utility, the range of values and the order of three most important dimensions [16]. The resemblance between the general characteristics of both value sets should support the comparability of Health State Utility Values obtained with these two types of EQ-5D questionnaire, and consequently the comparability of the results from economic analyses and the reimbursement decisions made, what was questioned in some other countries, such as the UK [48, 49]. What differentiates our study from the previous Polish valuation is greater attention to sampling, which resulted in a study group similar to Polish society as a whole, in terms of a higher number of demographic features (geographical spread in the first instance, but also employment status and size of locality).

In similarity to some other EQ-5D-5L valuations performed in developed countries, we noted the relative increase in the importance of the anxiety/depression dimension, in comparison to former EQ-5D-3L valuation studies. We suppose that this is primarily a consequence of a change in health state preferences over the period of one or two decades separating the valuation studies, rather than the effect of different wording in the EQ-5D-5L questionnaire. We may observe this phenomenon in England, Germany, the Netherlands, Spain and Japan [34, 36,37,38, 50, 51], whereas in lower income countries, such as Uruguay or The Philippines, anxiety/depression remains the least important domain [52, 53]. In addition to this observation, the predominance of the mobility dimension in Asian countries (China, Hong Kong, Indonesia, Japan, South Korea and Thailand) merits further investigation [39, 54,55,56,57]. Some changes in the dimension weightings may also be subject to change in the descriptive system: in the Polish version, the wording for mobility has been changed from ‘confined to bed’ to ‘extreme problems’.

Taking into account the number of CEE countries (20) and the relatively low gross domestic product these countries have, the objective of searching for simpler and inexpensive valuation protocols acquires further significance. Discrete choice experiment-based valuations performed online constitute a potential solution, although certain methodological challenges still have to be dealt with [58, 59]. In the meantime, researchers from the CEE region frequently face the dilemma: ‘what EQ-5D value set should I use in the absence of a national value set?’ According to the results of the recent review, in the case of EQ-5D-3L, CEE researchers mostly prefer the UK Measurement and Valuation of Health study tariff [60, 61]. In the case of EQ-5D-5L, the choice will be harder, as the EQ-5D-5L value set for England has faced criticism and is still not supported by the National Institute for Health and Care Excellence [48]. Slovenian researchers may use the cross-walk approach based on their visual analog scale-based EQ-5D-3L value set, but recommendations for scientists from other CEE countries are far from straightforward [21]. Nevertheless, they should at least consider using either the Polish or the forthcoming Hungarian EQ-5D-5L value sets, as CEE countries share some common cultural and historical background.

5 Conclusions

We developed the TTO and DCE-based EQ-5D-5L value set for Poland. It will complement the existing Polish EQ-5D-3L value set and will further stimulate the development of local quality-of-life research and the use of health technology assessment in decision making within the healthcare sector. While the new EQ-5D-5L value set offers more sensitivity, it also provides ground for consistency of past and future decisions. The presented EQ-5D-5L value set may be considered as an option by researchers from CEE countries who lack their own national health preference data.