Head-to-head comparison of health-state values derived by a probabilistic choice model and scores on a visual analogue scale

Background Health states were quantified based on discrete choice (DC) modeling and visual analogue scale (VAS) values using the five-level version of the EQ-5D (EQ-5D-5L). The aim of this study was to determine the extent of the relationship between DC derived values (indirect method) and VAS values (direct method). Methods Data were collected in Canada, the United Kingdom, the Netherlands, and the United States. Respondents were asked to perform paired comparisons between two EQ-5D-5L health states for DC. In total, 400 different EQ-5D-5L states were included. After each DC task, respondents were prompted to score the two states one after another on a VAS. Intraclass correlation coefficients were calculated between DC and VAS values and illuminating graphs were designed. Results Approximately 400 respondents participated from each country. High similarity [individual intraclass correlation coefficients (ICC) >0.85] of DC and moderate correspondence of VAS values were observed for the four countries. Cross-country comparison of DC values shows a nonlinear relationship to the VAS values. Conclusion EQ-5D-5L derived DC and VAS values show a close but nonlinear relationship. Given the obvious biases associated with the VAS, DC methods based on ordinal responses may be a better alternative.


Introduction
For decades, the merits and assumptions of direct methods of eliciting preferences for health states have been studied and debated. Direct valuation methods include the standard gamble (SG), time trade-off (TTO), rating scale, visual analogue scale (VAS), magnitude estimation, and person trade-off [1][2][3]. More recently, Thurstone scaling and extensions of this indirect (latent) method based on ordinal responses, namely the class of probabilistic choice models, have been explored in the field of health-status measurement [4][5][6][7]. All these valuation methods and latent measurement models are based on specific theoretical assumptions and arise from diverse disciplines. Empirical studies on the relationship between the values produced by each of these methods have revealed differences in the values elicited by the different methods and in their measurement properties [8]. So far, there is little agreement about which method is the most appropriate.
The use of different valuation methods has been of particular interest to developers of multi-attribute health classifier systems. The relationship between preferences generated by the various methods has been examined for the TTO, SG, and VAS [9,10]. The relationship between discrete choice (DC) and TTO was recently examined by Pullenayegum and Xie [11], who found high correlations (r = 0.79 to 0.86) between health values measured through TTO tasks and latent values derived from DC data. Bijlenga et al. [12] investigated the agreement between values derived with DC and VAS values, and found a Cohen's kappa of 0.79. To further understand the validity of the DC approach, this study examines the relationship between VAS-and DC-derived values and their underlying data structure using data from valuation studies for EQ-5D-5L conducted in four different countries.

Theory Probabilistic choice models
Choice modeling offers an approach to explore people's values, and has good prospects for health-state valuation. Actually, it relates to one of the oldest traditions in measuring subjective phenomena, namely the estimation of cardinal or metric measures based on ordinal responses. Thurstone's 'law of comparative judgment' provides the conceptual foundation for most means of deriving cardinal values from ordinal assessments [13]. Following Thurstone, Bradley and Terry [14], Luce [15], and McFadden [16] further developed the basis for choice methods and refined their analytic capacity. Kind [16] presented the first application of the Bradley-Terry-Luce approach to healthstate valuation.
What all probabilistic choice methods (logit or probit regression models) have in common is that they can establish the relative merit of one phenomenon (e.g., health states) with respect to others. Modern probabilistic choice models came from the field of econometrics and have been built upon the work of McFadden [17]. The models encompass a variety of experimental design methods, data collection protocols, and statistical procedures that can be used to predict the choices that subjects will make between alternatives (e.g., health states). These methods can be applied when subjects may choose between two or more distinct ('discrete') alternatives. In short, discrete choice models are grounded in modern measurement theory and are consistent with the random utility model in economic theory [18]. Interest in these methods has recently been revived in the area of health economics due to the relative simplicity of eliciting ordinal responses and the availability of a wide range of analytic tools to accommodate these responses [6,19,20].

Visual analogue scale
Visual analogue scale (VAS), a renowned direct valuation method, originated in the social sciences and has been popular among psychologists. This type of scale has a long history and was initially called 'graphic rating' [21]. Aitken and Zealley were among the first to apply a VAS in medicine, using it to construct single-item mood scales [22,23]. Ever since, the VAS has been a common research and clinical tool in psychological medicine, especially for measuring pain. Priestman and Baum [24] were probably the first to use a VAS ('linear analogue self-assessment') to assess quality of life among cancer patients. To our knowledge, Patrick and his colleagues were the first to use a VAS to derive values for hypothetical health states [25].
Essentially, a VAS (also sometimes called a 'semantic differential') is simply a straight line of a specified length with verbal descriptors at each end (anchors) consisting of short and easily understood phrases that describe the variable being measured. However, markers are often added to the line, usually with numbers attached. Formally called rating scales, these are often referred to as VAS, which for historical reasons is also the case for the EQ-VAS [26,27].
The VAS is employed by developers of preferencebased measures, such as the EuroQol Research Foundation, in several ways. The first is the conventional one in which an individual uses the VAS to indicate (e.g., to score or value) his own actual health status. Alternatively, the VAS is used to derive valuations of a set of hypothetical health states that are simultaneously assessed on one single VAS (multi-item VAS) such that a respondent evaluating outcomes A, B, and C must consider whether A is preferable to B, B preferable to C, and A preferable to C; the respondent also has to decide on the strength of these preferences.

Measurement properties
There are theoretical and methodological differences between the direct valuation method (VAS) and the indirect (latent) measurement method based on choice models. Nonetheless, for both methods we assume that individuals have implicit preferences for health states, ranging from good to bad, and preferences can be revealed and expressed or derived quantitatively. Accordingly, differences between health states should reflect the increments of difference in the severity of these states. By implication, measures should lie on a continuous (unidimensional) scale. The differences between values would reflect true differences (e.g., if a patient's score increases from 40 to 60, this increase is the same as from 70 to 90: interval level). This would mean that the values derived by different methods should have a linear relationship.

Measurement mechanisms
With the VAS, the scores are directly positioned on a continuous scale (thermometer), and are equivalent to the values that we are interested in. As such, the VAS is a direct measurement method. Yet it should be kept in mind that in daily life people rarely make absolute judgments (i.e., attach a numeric measure, as done with the VAS). Most judgments consist of choices, and are thus inherently comparative. Therefore, the core activity of founded theories and models for subjective measurement, such as the DC models, is to compare two or more entities in such a way that the data will yield compelling information. Technically speaking, these models take individual values obtained at one measurement level and transform these to an aggregated level, specifically to produce an interval scale from ordinal data. So, response data produced by exercises as input for a choice model are not very informative. It is the inference based on the statistical algorithm of a specific choice model that produces the derived values. For this reason, such methods are referred to as indirect.

Study design
A study design was developed and implemented in Canada, England, The Netherlands, and the United States (US) between September 2010 and August 2011. Values for EQ-5D-5L health states were elicited by means of time tradeoff (TTO; not presented), a choice model based on paired comparisons, and VAS. In this study, the most basic multiitem VAS was applied, whereby each time two hypothetical health states (from the paired comparison task) were scored one after the other.

EuroQol-5D-5L
The health states selected for valuation were based on the EQ-5D-5L descriptive system. The EQ-5D-5L comprises the same five dimensions as the original three-level EQ-5D, i.e., Mobility (MO), Self-Care (SC), Usual Activities (UA), Pain/Discomfort (PD), and Anxiety/Depression (AD). In the EQ-5D-5L, however, the level structure is expanded, giving each dimension five levels: no problems, slight problems, moderate problems, severe problems, and extreme problems/unable to do something [28]. On the basis of responses to the EQ-5D health-state classifier, a preference-based scoring function can be applied that generates a single value for health.

Respondents
In each of the four countries, at least 400 persons participated in the study. Representative samples from the general population (stratified by age, education, and sex) were recruited in each country with a minimum age of 18 years. Instructions and valuation tasks were presented and responses were collected within a digital setting by using a computer-assisted personal interview mode of administration: the EuroQol Valuation Technology [29]. To ensure that the valuation tasks were understood, the web-based assessments were interviewer-assisted. About three trained interviewers oversaw groups of approximately 15 respondents in six to eight sessions per day. In England, identical software was used; however, a team of eight home-based interviewers conducted the assessments in a one-to-one setting [30]. Respondents were paid a small sum by the panel administrators for completion of the survey. The exact amount, which differed across the countries, was in the range of €20 to €60.

Experimental design
A Bayesian algorithm was used to generate an efficient design consisting of 200 paired comparisons (i.e., 400 health states). Priors in the estimation algorithm were based on an earlier study [31]. The 200 paired comparisons were subdivided into 20 blocks so that each respondent would make 10 paired comparisons.

Response tasks
Respondents were asked to perform the most simple response task in the framework of choice models, namely a paired comparison between two EQ-5D health states (Fig. 1). Everyone had to make a forced choice between ten different pairs of EQ-5D-5L states. These paired comparison tasks did not include ''dead'' or duration statements. No ''status quo'' or ''opt-out'' choices were offered. In total, 200 pairs of states (400 states in all) were judged (order of the pairs and order within each pair had been completely randomized by the computer system) in each country [32]. After each paired comparison (e.g., DC task), respondents were prompted to assess the two states on a single VAS. First, they assessed the state that had been judged as the best health state in the DC task; then they assessed the other health state, whereby the placement of the first one was shown on the VAS as well. At the start of the session with the paired comparisons, an animation popped up on the screen to explain the general purpose of the task and instruct the participant on what to do. This animation also explained the VAS tasks that proceed from the paired comparison tasks.

Analysis
For the DC analysis, a multinomial probit model (alternative-specific multinomial probit, STATA) was used to analyze the paired comparison data. This is equivalent to standard approaches to analyzing paired comparison data or discrete choice data (e.g., McFadden model). The maineffects model included 20 dummy variables to represent  Added to our model is an alternative-specific constant (ASC) capturing a tendency to always choose the first option, which makes the full formula: The VAS values were aggregated to 400 EQ-5D-5L pooled states per country, whereas for the DC the 400 states presented in the paired comparison tasks were predicted by the formula. For VAS and DC, individual intraclass correlation coefficients (ICC) were calculated between values per country versus the pooled values of all four countries. Pearson's correlations, ICCs, and quadratic regression functions were calculated between VAS and DC values. The DC values were regressed on the VAS values.
Charts for all combinations of countries and their regression functions were made in SigmaPlot (version 12.0; Systat Software, San Jose, CA) to investigate differences in constant and slope. Detailed graphs were also constructed to reveal the underlying distributions between DC and VAS values.

Respondents
The number of individuals who entered the study was 547 for Canada, 404 for the UK, 407 for the Netherlands, and 417 for the US. A total of 1775 respondents completed all 17,750 paired comparisons. Age distribution was similar across the four countries, although the Netherlands had a smaller proportion of younger participants and a larger proportion of middle-aged ones ( Table 1). The mean age in the entire dataset was 40 years (SD 16), with a range of 18 to 100. Regarding gender, the differences between countries were modest. The samples closely matched the populations on these key characteristics. Additional demographic information was collected only for the US and the UK. Among US respondents, 70.8% reported that they had received education beyond high school; 65.8% were non-Hispanic white (n = 273), 17.6% African American (n = 73), and 16.6% other ethnicities. The UK sample included a larger proportion of degree-educated and employed individuals compared to the general population, but the sample was broadly representative in terms of other background characteristics, such as ethnicity [33].

Completion
The number of judgments for each separate health state in the four countries ranged from 15 to 42 (mean 22.5, SD 2.68). In the Dutch study, one block of states (block 11) was not assessed due to a programming error. The number of drop-outs (individuals not completing all of the valuation tasks) was low in all countries (ranging from UK 4 to

Health-state values
Pooled DC coefficients were all statistically significant. All showed a logical increase that corresponded with the underlying structure of the levels of the attributes ( Table 2). One exceptional finding is that the coefficient of slight pain/discomfort (PD2) was more negative than that of moderate (PD3) pain/discomfort. The average DC values per country differed similarly to the average VAS values for the set of 400 EQ-5D-5L states, showing the highest mean values over all health states in the US, followed by Canada, then by the Netherlands, with the lowest values in the UK.

Comparability of countries' health-state values
When the DC values for the 400 EQ-5D-5L health states are compared to the pooled DC value, the result is a high Pearson's correlation: almost 1.00 for Canada, 0.99 for UK, 0.99 for the Netherlands, and 0.99 for US (Fig. 2).
The VAS values for the 400 EQ-5D-5L health states showed strong correlations with the pooled VAS value: Canada 0.95, UK 0.93, the Netherlands 0.93, and US 0.93 (Fig. 3). The values for differences between severe health

Comparability of the two valuation methods
The cross-country comparison of the pooled DC values and pooled VAS values of the 400 EQ-5D-5L health states showed a strong correlation (r = 0.93). Comparisons between the individual countries showed strong correlations: Canada 0.88, UK 0.90, the Netherlands 0.86, and US 0.86 (Fig. 4). The UK values for VAS as well as DC were found to be lower than the pooled values and the other three countries' values. Graphical representation of the values obtained by the two methods in relationship to (the pairs of) states showed the following. First, mean VAS values were all positioned in a relatively small range (ca. 40-70) of the total scale (Fig. 5b). Small perceived differences between two states of a pair (e.g., close to 50% choice in favor of one of the two states in the DC task) are reflected in small differences between VAS values (Fig. 5b). Differences in values for the VAS pairs (Fig. 5b) are also clearly reflected in the derived DC values (Fig. 5a). Another correspondence between the DC and VAS was revealed: for the VAS, although respondents tend not to score on the boundaries (\30, [70), pairs of states that were scored low or high on the VAS were predicted under the DC model also as low or high. A clear example of this correspondence is depicted in Fig. 5a and b (see: two circles, two boxes). The standard deviations of the VAS values are relatively homogenous among the 200 pairs of health states that include pairs of rather modest states and pairs of rather severe states (Fig. 5c).

Discussion
The cross-country comparison of DC and VAS values demonstrated a nonlinear relationship between the two methods. The curvature in the relationship suggests that values differ more at the ends of the scale. In this case, this implies that differences between worse health states are greater under the DC model, and differences between relatively good health states are larger in the VAS. Divergence between the four countries was modest, although the UK showed a small deviation as the values of VAS, and DC showed lower values for the worse health states. However, it is hard to say whether any differences in these values are due to cultural notions, methodological differences, or to translational issues (e.g., Dutch wording may make levels 2 and 3 seem closer together than in other language versions). A limitation of the study is that the assumed independence of the VAS-and DC-derived values is reduced due to the chained nature of the task. Moreover, when the respondents positioned two health states simultaneously on a single scale (multi-item VAS), consistency with the preceding discrete choice task was required. They were offered the opportunity to redo that choice or correct their VAS score. In case two health states were compared that were very different, the chained nature would hardly affect both assessments; however, if the two states were relatively similar, differences in the VAS scores might be inflated. Overall, the multi-item VAS may be regarded as a compound task in which the prior comparison is supplemented with a level of rating, thereby enforcing congruence between methods.
The implication of the observed nonlinear relationship between DC and VAS is that the differences between health-state values obtained by one method are not proportional to the differences obtained by the other one. The curvature in the relation between DC and VAS can be partially explained by referring to theory and empirical evidence on VAS biases. Many studies have shown that the VAS is prone to diminished use of the upper and lower part of the scale (end-aversion). That propensity leads to range reduction and the typical curviness of VAS values as compared with other valuation methods [34]. This phenomenon is known from studies that compared VAS values with methods such as standard gamble and TTO [9,10]. Any nonlinear relationship indicates that one, or even both, valuation methods are producing values that do not possess interval level or cardinal measurement properties. The limited value range of the VAS observed in this study seems to confirm this phenomenon.
We are aware of the longstanding theoretical debate about whether or not interval properties can be ascribed to VAS. Economists claim that responses to the standard gamble and the TTO have interval scale properties, whereas responses to rating scales, including the VAS, tend not to have interval scale properties because no trade-offs are expressed. Attempts have been made to find empirical evidence that mean health-state values collected with a (multi-item) VAS can be characterized roughly as interval data. One study, based on a rank-based scaling method (unfolding), observed a very strong relationship that supports the interval scale property of the VAS data [35]. Confirming results were found in a study that applied nonmetric multidimensional scaling on data (metric and ranks) that were derived from VAS values [36].
On the other hand, it is well documented that the VAS is prone not only to end-aversion but also to context or reference bias. The presumed independence of the set of health states to be positioned on the VAS has been rejected in two Dutch and one Australian study [37][38][39]. These clearly showed that different values will be collected with a multi-item VAS for a fixed set of health states if these are presented along with varying other states. In addition, the choice and phrasing of the anchors leads to different results [40]. . c Standard deviations VAS values. On all three graphs, the x-axis is ordered on percentage of respondents in favor of health state A As such, the measurement procedure of the DC seems to have an advantage over VAS, as the former may be free of certain biases that seem to occur in the VAS. Moreover, DC modeling offers several attractive characteristics not found in other methods of health-state valuation. It is grounded in modern measurement theory, and the judgmental task and the analysis are executed within one unifying framework. In addition, this measurement framework can be extended to other strong measurement models, such as item response theory and structural equation modeling. Unlike the DC model, the assignment of values to health states by the VAS is not embedded in a strong theoretical measurement framework [30,[41][42][43]. An axiomatic approach called measurable value functions [44] has been described to deal with VAS-generated data. These are derived from individual responses using algebraic and deterministic axioms. Violations of theoretical predictions or conditions do not usually lead a behavioral scientist to reject the corresponding theory or hypothesis. An error theory would have to be added to these axioms to make them applicable in the behavioral and social sciences. For the VAS, however, such a probabilistic value method has never been presented. Moreover, as mentioned earlier, two pieces of empirical evidence have been brought to bear against the interpretation of VAS as a measurable value function: context bias and end-state aversion.
Given these concerns about VAS, we would suggest a potentially better alternative: to use DC methods based on ordinal responses. Furthermore, several basic (mathematical) assumptions, conditions, and requirements of the DC model warrant closer examination in the context of healthstatus measurement.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://crea tivecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.