Respondents
A representative sample was recruited from the Indonesian general population, with a minimum of 1000 respondents aged 17 and over, based on the work of Ramos-Goñi et al., to obtain a 0.01 standard error (SE) of the observed mean composite time trade-off (C-TTO), 9735 C-TTO responses were needed. Therefore, the 1000 respondents interviewed provided 10,000 C-TTO and 7000 discrete choice responses to estimate the models [23]. The adult population was defined as aged 17 and over, because in Indonesia, the legal age to obtain an ID card, a driving license, and access to voting is 17. To ensure the representativeness of the final sample for the Indonesian general population, we used a multi-stage stratified quota method with respect to residence (urban/rural, as registered by the official national register); gender (male/female); age (17–30/31–50/>50 years); and level of education: basic (primary school and below), middle (primary school plus at least 1 year of high school) and high (all others). This resulted in the first stage of 36 quota groups. Two other categories, religion (Islam/Christian/Others) and ethnicity (own-declared ethnicity: Jawa/Sunda/Sumatera/Sulawesi/Madura-Bali/Others), were considered important as well. However, including them in the same way as residence, gender, age, and education would result in 36 × 3 × 6 = 3888 quota groups. We therefore used religion and ethnicity quotas independently from the other factors. So religion and ethnicity are representative over the whole sample, but within the individual 36 quota groups this might not be the case. To take account of this second layer of sampling, we called this a ‘multi-stage stratified quota’. The predefined quotas were based on updated data from the Indonesian Bureau of Statistics [1].
We designed and used an online tool to ensure that the recruitment of respondents was in accordance with predefined quotas while the sampling was employed in different parts of the country. Interviews were conducted in the following six cities and their surroundings, located in different parts of Indonesia: Jakarta, Bandung, Jogjakarta, Surabaya, Medan, and Makassar. Respondents were recruited through a mixed strategy, i.e. through personal contact, local leader assistance, and from public places such as mosques and shopping streets. We also asked respondents to introduce us to other potential respondents. Interviews were conducted at the respondents’ or interviewers’ homes. For their participation, all respondents received a mug or a t-shirt specifically designed for the valuation study. Informed consent was obtained from all respondents included in the study. The study was approved by the Health Research Ethics Committee, Faculty of Medicine, Padjadjaran University, Indonesia.
Instruments
EQ-5D-5L
We used the official EQ-5D-5L Bahasa Indonesia version provided by the EuroQol Group. This translation of EQ-5D-5L was produced using a standardized translation protocol that followed international recommendations [24]. As briefly mentioned in the introduction, EQ-5D-5L consists of five dimensions: mobility (MO), self-care (SC), usual activities (UA), pain/discomfort (PD), and anxiety/depression (AD). Each dimension has five levels: no problems, slight problems, moderate problems, severe problems, and unable/ extreme problems [12]. The EQ-5D-5L instrument describes 3125 (55) unique health states. A 1-digit number expresses the level selected for that specific dimension. Hence, combining a 5-digit number for five dimensions will describe a specific health state. For example, state ‘11111’ indicates ‘no problems on any of the five dimensions’, while state ‘54321’ indicates ‘unable to walk about, severe problems washing or dressing, moderate problems doing usual activities, slight pain or discomfort, and no anxiety or depression’ [12]. Each health state has a so-called ‘sum score of the level digits’, which means the sum of the levels across domains; for example, ‘11111’ sum score of the level digits is 5 and ‘54321’ is 15. This EQ-5D descriptive system is followed by self-rating of overall health status on a visual analogue scale (EQ VAS) ranging from 0 (‘worst health you can imagine’) to 100 (‘best health you can imagine’).
Valuation Protocol
The EQ-5D-5L valuation protocol consists of five sections [19]:
-
1.
A general welcome, where the interviewer explains the objectives of the research, followed by filling in the informed consent when the individuals agree to participate.
-
2.
Introduction to and completion of the descriptive system, VAS and background questions (age, sex, experience of illness, religion, ethnicity and education).
-
3.
C-TTO (see Sect. 2.2.3 below) tasks followed by a ‘Feedback Module’ task. Each respondent has to complete one example (health state: being in a wheelchair), three practice health states (mild: ‘21121’; severe: ‘35554’; and moderate but difficult to imagine: ‘15411’) and ten ‘real’ C-TTO tasks valuing hypothetical EQ-5D-5L health states. In the Feedback Module task, the respondents check whether they agree with the order of the health states they valued before. The EQ-VT screen shows health states for 10 C-TTO tasks arranged based on their value given by the respondents: from the lowest value at the bottom to the highest value at the top. Respondents are allowed to ‘flag’ the health state(s) for which they do not agree with the previously given relative position to other health states, but they are not allowed to alter their initial values. Three debriefing questions regarding the difficulties of the C-TTO tasks are added at the end of this section.
-
4.
A discrete choice experiment (DCE, see Sect. 2.2.3 below) followed by three debriefing questions regarding the DCE. Each respondent has to complete seven forced-pair comparisons.
-
5.
A round-up, where respondents can comment on the valuation tasks.
-
6.
Country-specific questionnaire(s) (if any).
All sections were administered utilizing computer-assisted face-to-face interviews employing the EQ-VT platform version 2.0.
Preference Elicitation Methods
Time trade-off (TTO) has been widely used as a standard method to elicit preferences [25, 26]. C-TTO uses conventional TTO to elicit better-than-dead (BTD) values, and lead-time TTO to elicit worse-than-dead (WTD) values. Details regarding C-TTO can be found in the study by Oppe et al. [27]. In summary, respondents were first faced with ‘conventional’ TTO where they had to choose between 10 years in an impaired health state (Life B) and 10 years of full health (Life A). After a series of choice-based iterations, respondents achieved a point of equivalence between the length of time in full health (Life A): ‘x’ and a period of time (10 years) in the impaired health state (Life B). The impaired health state value is defined as x/10. For example, if a respondent could not differentiate between 3 years of full health in Life A and 10 years living in Life B, then that health state value would be 0.3 (3/10). For a really poor health state, respondents might prefer to die immediately; that is, the value for that specific health state is <0 (death value = 0). In this case, the lead-time TTO approach was introduced to allow respondents to express a value below the value of death; that is, below 0. The two lives in the lead-time TTO are 10 years of full health (Life A) and 10 years of full health followed by 10 years in the impaired health state (Life B). When respondents reach an indifference point between the amount of time ‘x’ in Life A and Life B, the health state value is defined as (x − 10)/10. Hence, −1 is the lowest possible value of a given health state, generated from trading the full 10 years of Life A in a lead-time TTO.
The EQ-5D-5L valuation protocol included 86 EQ-5D-5L health states to be valued using C-TTO. The 86 health states were distributed into ten blocks with a similar level of severity. Eighty unique heath states were selected using Monte Carlo simulation (eight unique heath states included in each block), five very mild states (only one dimension at level 2 and all others at level 1, e.g. ‘11112’) (each included in two blocks) and the most severe/‘pits’ state (‘55555’) (included in all blocks) [19]. Respondents were randomly assigned to one of the ten C-TTO blocks. Each state of the block was presented in random order to respondents using the EQ-VT platform.
However, it was realized that TTO has its limitations. EuroQol Group considered different valuation techniques to be used in conjunction with TTO to make the valuation studies more robust and valid. Previous experiments with DCE, like the study by Stolk et al. using EQ-5D-3L [28] or Ramos-Goñi et al. using EQ-5D-5L [29], showed that the DCE is a valid valuation technique to get health preferences from respondents. Since both TTO and DCE try to measure the same concept, it was anticipated that DCE could be used in combination with TTO [30]. In the light of this reasoning, DCE was included in the EuroQol VT protocol.
Each DCE task was conducted by presenting two health states and asking the respondent to select the preferred state for him/her. The DCE design consisted of 196 pairs of EQ-5D-5L health states distributed over 28 blocks, each consisting of seven pairs with a similar severity [19]. The seven paired comparisons were presented in random order by the EQ-VT; in addition, the right–left order of the two health states presented was also randomized.
Data Collection
At the outset, 13 interviewers were recruited and trained intensively in a 1-day workshop at two locations: (1) Jakarta for interviewers who worked in Jakarta, Bandung and Makassar; and (2) Jogjakarta for interviewers who worked in Jogjakarta, Surabaya and Medan. Each interviewer performed at least five pilot interviews in the week after training. Their experiences were discussed and feedback was given by the daily supervisor. Only after this were they permitted to conduct real data interviews. Three additional interviewers were hired during the data collection and they received similar training and met similar requirements to the first 13. Interviews were performed between March 9, 2015 and January 24, 2016. After 102 interviews we evaluated the quality of the interviews (see Sect. 2.5 below) and we concluded that their quality was not yet sufficient. Hence we retrained the interviewers and treated the 102 interviews collected thus far as pilot interviews, excluding the 102 interviews in the data analysis. A detailed description of this decision-making process and the retraining of the interviewers is provided elsewhere [30].
Exclusion Criteria
There were two main criteria for data exclusion: lack of completion of an interview and characteristics of respondents’ answers that related to poor understanding of the task or to errors [31]. Note that the first criterion concerns excluding respondents and the second excludes respondent answers/responses.
With respect to the first criterion, interviews were excluded when respondents did not finish the interview for the following reasons: (1) the respondent indicated that he/she did not want to continue the interview process, (2) interviewers concluded that the respondent was unable to differentiate between the different dimensions and levels of EQ-5D-5L, (3) interviewers concluded that the respondent was not able to comprehend the C-TTO task during the practice session. When an interview had to be stopped during the C-TTO task it was excluded from the study.
With respect to the second criterion, completed interview responses were excluded on account of any of the following characteristics: (1) a respondent had a positive slope on the regression between his/her values on C-TTO and the ‘sum score of the level digits’, as this would indicate that the respondent provided higher utility values for poorer health states on average—the slope of the regression between C-TTO and the ‘sum score of the level digits’ was generated as part of the standard quality control report; (2) when a response in the C-TTO tasks was judged to be irrational: for instance, preferring life B (10 years in the corresponding health state) to life A (10 years in full health) and not shifting after his/her initial response was reconfirmed by the interviewer; (3) responses that were marked by the respondents in the Feedback Module task, which was a sign that the respondents disagreed with the valuation of those responses.
Quality Control
To ensure data quality, we followed the quality control (QC) process described by Ramos-Goñi et al. [32], which consisted of minimum quality criteria and cyclical feedback to improve interviewers’ skills. The EuroQol Group facilitates use of the EQ-VT QC tool, which is a software programme that automates the production of QC reports based on data from EQ-VT studies. Bi-weekly meetings (teleconference-based) were organized to discuss the QC reports with the EQ-VT support team. The aim of these meetings was to evaluate and improve the interviewers’ performance and to check for possible non-compliance to the interview protocol.
Minimum Quality Criteria
The QC reports provided a number of statistics related to the quality of the data collected thus far, differentiated by interviewer.
-
1.
Wheelchair time: when the duration of time an interviewer used to explain the ‘wheelchair example’ preceding the actual C-TTO tasks was <3 min.
-
2.
Wheelchair lead-time: when the interviewer did not explain the WTD element of the wheelchair example.
-
3.
C-TTO duration: if completing the ten C-TTO tasks took <5 min.
-
4.
Inconsistency: the value for state ‘55555’ was not the lowest and it was at least 0.5 higher than that of the state with the lowest value
If any of the four above-mentioned signs are observed, the interview is ‘flagged’ as being of suspicious quality. If four or more of the interviews are flagged as being of poor quality, all ten interviews thus far conducted by that specific interviewer are removed and retraining of that interviewer is conducted. After a further ten interviews, the performance and compliance are re-evaluated. If again four or more interviews are flagged, the next set of ten interviews will also be removed and the interviewer is removed from the data collection process. Quality control focused on the interviewer; responses in flagged interviews were not removed from the data that was analysed.
The DCE part of the valuation study was also monitored to detect suspicious response patterns. Assuming that A is the health state at the left of the screen and B is the health state at the right of the screen, then a consistent preference for the left (A) would be suspicious (AAAAAAA). The same would apply for the response pattern BBBBBBB, ABABABA, BABABAB. This was also reported in the QC report.
Cyclical Feedback
The retraining programme conducted by the daily supervisor was held in 2 locations: (1) Jakarta for interviewers who worked in Jakarta, Bandung and Makassar; and (2) Jogjakarta for interviewers who worked in Jogjakarta, Surabaya and Medan. The QC reports for their interviews were presented, discussions were held to address non-compliance problems, and suitable solutions were agreed upon among the interviewers. After the retraining programme, the daily supervisor continuously created QC reports, made notes at the group and individual levels, and sent feedback to the interviewers, so that they were able to learn from their own and other interviewers’ performance.
Data Analysis
We describe the sample characteristics including self-reported health on the EQ-5D-5L descriptive system and the EQ-VAS using percentages for discrete variables and means and standard deviations for continuous variables in comparison with the Indonesian population. A general z test was used to investigate whether the proportions in the sample were similar to, or different from, the general population.
In this investigation we used TTO (specifically C-TTO) and DCE. TTO has limitations such as loss of aversion [33], but also has advantages as the TTO-based value sets are anchored on a scale of (0) death to (1) full health. DCE is not exempt from limitations, as lexicographic behaviour from respondents has been widely reported in the literature [34]. It is also noticeable that DCE, in its present form, where time is not incorporated in health state presentations, does not anchor value sets on a (0) death to (1) full health scale. Therefore, DCE produces value sets on an arbitrary scale based on the relative distances between health states.
However, both techniques attempt to measure health states preference, but using different underlying assumptions, and seem to not share the same limitations. Therefore, the data obtained from these two elicitation methods could be seen as complementary, not necessarily competing with each other. Hence, we chose the solution presented by Oppe and van Hout [35], who combined DCE with C-TTO in a ‘hybrid model’, imposing the (0) death to (1) full health scale as determined by C-TTO.
To illustrate how the hybrid model combined C-TTO and DCE responses in this study, we also present the results from the models estimated from each C-TTO and DCE separately, with the same assumptions as those used for the hybrid model. We used the 20-parameter main effects model, which estimates four parameters for the five levels of each of the five dimensions: mobility, self-care, usual activities, pain/discomfort and anxiety/depression. Each coefficient represents the additional utility decrement of moving from one level to another. Hence, the overall decrement of moving from ‘no problems’ to ‘unable/extreme problems’ is calculated as the sum of the coefficients of ‘no problems to slight problems’, ‘slight problems to moderate problems’, ‘moderate problems to severe problems’, and ‘severe problems to unable/extreme problems’.
Presenting the TTO, the DCE and the hybrid model also allows us to compare the value distribution in the form of the correlations between the predicted values of the models, and we can compare the weights of the individual dimensions. This gives information about construct validity in the form of ‘convergent validity’, or ‘concordance’.
Modelling was undertaken using the STATA statistical package. C-TTO data were modelled using the response values as dependent variables and the health states as explanatory variables. This was achieved by the implementation of a Tobit model (hyreg with ll() option), which assumes a latent variable Y*it underlying the observed Yit of C-TTO values when there is either left- or right-censoring in the dependent variable. The C-TTO data, in particular the lead-time C-TTO for WTD health states, is by nature censored at −1 [ll(−1) option on hyreg command]. This means that observed preference values were valued by the C-TTO method at −1, despite the latent preferences of respondents possibly including values lower than −1 [36]. The Tobit model accounts for this censoring by estimating the latent variable Y*it, which can take on predicted preference values extrapolated beyond the range of the observed values. Variance of C-TTO data is not homogeneous among health states; this led us to model C-TTO data as heteroskedastic data. We used the hetcont() option of the hyreg command as suggested by Ramos-Goñi et al. [37]. The dummy variables included in the hetcont() option were the same as those included in the main model, that is, the 20 dummies that specified the main effects model.
DCE (forced pair comparisons in our case) responses were modelled as a conditional logistic regression model including the same 20 dummy parameters as those used for the C-TTO data. Nevertheless, we did not use the coefficients estimated from a conditional logit model because they were expressed on a latent arbitrary utility scale. We rescaled the DCE coefficients using the same parameter θ that was estimated in the hybrid model. This rescaling assumes that the C-TTO model coefficients are proportional to the DCE model coefficients. For more details on the modelling see Ramos-Goñi et al. [23, 37].
Pearson product-moment correlation analysis was applied to measure the strength and direction of association that exists between C-TTO, DCE rescaled and hybrid predicted values for 3125 health states.