8.1 The Three Cs: Content, Construct and Criterion Validity

While goodness-of-fit measures and prediction success may be used to assess the validity of the econometric model, broader aspects of validity of the obtained value estimates should also be assessed when conducting DCE surveys. In general, the overarching goal of most applied environmental DCE surveys is to provide welfare measures that mirror as accurately as possible the actual values of the target population. A DCE survey with the highest level of validity would be one that produces WTP estimates that are identical to the true WTPs in the population. However, given that true values cannot be observed for non-marketed environmental changes, such a direct and simple test of the validity of a DCE survey is not available. Instead, the validity of welfare measures obtained from a DCE survey will have to be assessed through more indirect indicators. Different classifications of validity testing can be found in the valuation literature as well as in the broader survey literature (e.g. Bateman et al. 2002; Scherpenzeel and Saris 1997). In a recent paper, Bishop and Boyle (2019) present an overview and a useful framework for considering validity as well as reliability of non-market valuation surveys. They outline three different aspects of validity, referred to as “the Three Cs”: content validity, construct validity and criterion validity. All three are important for assessing the validity of welfare estimates obtained from an environmental DCE survey.

Content validity concerns the extent to which the chosen valuation method, as well as all aspects of its practical implementation, is appropriate and conducive for obtaining a measure of the true value. This involves assessing to what extent all the various components of the DCE survey (e.g. questionnaire development, the questions asked, scenario descriptions, survey information, attributes included, survey mode, sampling of respondents, etc.) have induced respondents to make choices that are in line with their true preferences. Content validity assessment may also consider the extent to which the analysis of choices and reporting of results are conducted in a way that appropriately conveys valid welfare estimates to relevant end users, e.g. decision- and policymakers. The assessment of content validity is inherently based on a great deal of subjective judgements and for a large part it basically relies on the analyst's common sense and accumulated experience and expertise.

Construct validity focuses more on the construct of interest, namely the value estimates and how the validity of these might be assessed in the absence of knowledge about the true values. One key element of construct validity is the so-called expectation-based validity. Often the analyst will have some prior expectations of the values and how they relate to other variables. One source of such expectations is economic theory. According to economic theory, the marginal utility of income is positive, though decreasing with increasing income. This presents two theoretical expectations that can and should be tested in DCE surveys. Most importantly, the parameter estimate for the cost (price) attribute should be significantly negative since paying money is equal to giving up income which according to the underlying economic theory implies a loss of utility. In other words, keeping everything else constant, increasing the cost of an alternative should decrease the probability of choosing that alternative. Estimating an insignificant or even a positive cost parameter would seriously invalidate the results of a DCE survey. Hence, this is probably the most crucial validity test that any DCE survey has to pass. An associated validity test concerns the decreasing marginal utility of income. Again keeping everything else constant, this implies that a respondent with relatively low income should be more sensitive to the cost of an alternative than a respondent with relatively high income. If there is sufficient income variation in the respondent sample, this can be tested, for example, by incorporating interactions between the cost attribute and dummy variables for income brackets. The parameter estimate for such an interaction should be significant, and the sign would depend on which income bracket is described by the incorporated dummy. For other non-cost attributes, there might also be expectations based on economic theory. Of particular relevance is what may be considered an internal test of sensitivity to scope. The non-satiation axiom of consumer preferences basically means that more consumption is always better than less consumption. Hence, if for instance people have positive preferences for the conservation of endangered species, and we use an attribute in a DCE to describe three different levels of species conservation (e.g. 10, 100 and 1000 species protected), one would expect \(WTP\left( {10} \right) < WTP\left( {100} \right) < WTP\left( {1000} \right)\), or at least \(WTP\left( {10} \right) \le WTP\left( {100} \right) \le WTP\left( {1000} \right)\).

Another source of expectations for construct validity tests could be intuition or past research experiences. For instance, given the plethora of water quality valuation studies concluding that people have positive WTP for improvements in water quality, one would expect to find a significantly positive parameter estimate for an attribute describing improvements in water quality. These types of validity tests should probably be considered less strict than those based on economic theory. There could be good reasons why a specific study might not find the same findings as other studies, e.g. if the target population differs. This would obviously not be as serious as finding out that the underlying economic theory assumptions were violated, but some good explanations would be warranted.

A somewhat different test of construct validity is the so-called convergent validity (see, e.g., Hoyos and Riera 2013). When previously conducted valuation studies, using DCE or other valuation methods, have been designed to estimate the same value as the DCE currently being conducted, the value estimates should be statistically similar. Thus, when discussing the results of a new DCE survey, it is common practice to compare WTP estimates to previous estimates of WTP for the same or similar type of good, if available. If results are statistically indistinguishable, such a convergence of results may be interpreted as an indication of construct validity. This relates to the expectation-based validity mentioned above relying on past research experiences. If one is conducting a new DCE investigating preferences for water quality improvements, the obtained WTP estimates should be compared to (some of) the many previous WTP estimates available in the literature. It is important to note, though, that even if WTP estimates are similar in two or more surveys, this does not guarantee that valid estimates of the true WTPs have been obtained. For instance, a DCE and a CVM survey might produce similar WTP estimates for a water quality improvement, but they may both suffer from hypothetical bias. Furthermore, if WTP estimates are found to differ significantly from previous estimates, can we conclude that either the new or the previous WTP estimates are biased? Or are they both biased, but to differing degrees? This underlines the importance of considering all the three Cs.

The last of the three Cs refers to criterion validity. This idea is quite similar to that of convergent validity, namely comparing the WTP estimates obtained in a new DCE survey to previously obtained WTP estimates for the same good. The main difference is that the previous WTP estimates have been obtained with a method that is generally considered to provide highly valid estimates of true WTP. Ideally, this benchmark would be market prices but this is obviously not relevant for non-marketed goods. A second-best solution is to look towards simulated markets or laboratory or field experiments involving actual economic transactions, see, e.g., Carlsson and Martinsson (2001) and Murphy et al. (2005), which is commonly considered to be of higher validity than purely hypothetical settings. In practice, however, such experiments involving actual payments for non-marketed environmental goods are quite rare, simply because they are often not possible to construct. Hence, quite often when considering WTP for environmental goods, criterion validity is impossible to assess.

To sum up, it is generally recommended to thoroughly consider the three Cs of validity at all stages of environmental DCE surveys—from initial conceptualisation and survey design through to data collection and analysis. The purpose here is to ensure as far as possible that the estimated values will reflect the population’s actual values for the described environmental change. The three Cs are equally important in the final stage; reporting results to end users. The aim here is to prove to end users that the generated value estimates are valid. For most environmental valuation contexts, it is not known what the true values are—and this is the motivation for conducting a DCE in the first place. Hence, end users’ validity assessments will inherently be quite subjective, based on whatever information is available to them. It is therefore recommended to report as much detail as possible about the background for the value estimates, thus essentially enabling end users to make their own assessments of the content validity. Carefully describe questionnaire development, data collection, and analysis, and make sure results are discussed thoroughly in relation to previous findings as well as theory-based or case-specific expectations. If reporting results in scientific journals or other outlets with page- or word-limits, it is recommended to provide supplementary material online. This includes the full questionnaire used for data collection as well as summary reports from focus groups and pilot testing. The sampling strategy, a detailed analysis of representativeness and econometric analysis may also be reported here in more detail than in the main report or paper. Though not yet standard practice, in the spirit of reproducibility it is also recommended to make data as well as code used to generate reported WTP estimates available to others in permanent and freely accessible repositories.

8.2 Testing Reliability

Reliability and validity determine the accuracy of estimates of welfare change derived using valuation methods. Both reliability and validity are often described with the metaphor of shooting arrows at a target, as in archery. Reliability implies that arrows are grouped closely together. This does not mean the arrows have hit the bullseye or are even close to it. To the contrary, reliability may be found if arrows are consistently off target, but in the same direction. Low reliability therefore means that repeated shots at the target are dispersed widely across the target. Validity, then, measures how close the arrows are to the bullseye. Therefore, in the words of Bishop and Boyle (2019, p. 560), “reliability is about variance and validity is about bias”. Relating the metaphor above to choice experiments, a shot at the target is like conducting a survey to generate an estimate of welfare change (say, of average WTP for an environmental improvement). It should be emphasised that this section is concerned about the reliability of a method, choice experiments, to obtain such a welfare estimate, and not about the reliability of the welfare measure (e.g. compensating variation) per se.

Typical valuation studies only permit a single “shot” at the target, i.e. a single survey. In this case, nothing can be said about the reliability of choice experiments as a method to derive estimates of welfare change. The main procedure used in social science to assess the reliability of survey-based measurements is to conduct a test-retest study (Yu 2005; Liebe et al. 2012). Ideally, the same subjects conduct the same task, e.g. responding to a survey or participating in an experiment, at two (or more) points in time, and provide independent observations. Therefore, instead of having only one shot at the target, now there are two (or more). Statistical tests can then be used to test the hypothesis of equality with respect to measures or indicators that the tested methods are supposed to provide.

In the context of choice experiments, a test-retest experiment implies conducting the same survey again, at different points in time, i.e. conducting several survey waves at points in time \(t, t + 1, \ldots , t + n\) where \(n\) defines the time lag between survey waves. This can be done with the same subjects (within-subject test-retest), who ideally then answer exactly the same choice sets; or, if within-subject tests are not possible, the retest is undertaken with a different sample from the same population (between-subject test-retest). Within-subject tests are considered advantageous over between-subject tests, although there are challenges regarding the assumption that observations at two (or more) points in time are indeed independent. We will discuss this further below. Moreover, there is an implicit (i.e. not typically tested or controlled for) assumption that respondents answer the surveys at different points in time under the same circumstances. However, this may or may not be the case. Given that most test-retest studies are conducted using a web survey mode, let us consider a web survey. A respondent may complete it in a busy workplace environment on a desktop PC in the first wave and on a mobile phone while relaxing on the sofa at home at the weekend in the second wave.

As of now, test-retest studies in the choice experiment literature consider some or all of the measures of test-retest reliability listed below. It should be noted that these also apply to comparisons of choice consistency within the same survey wave (e.g. Brouwer et al. 2017; Czajkowski et al. 2016), which is, however, not the focus of this section.

  1. (a)

    Tests of congruence (or consistency) of choices for data collected in different survey waves. Such tests can comprise congruence of choice of alternatives, across the whole sample, blocks of the experimental design or for individual choice sets, and the number of congruent choices that each individual respondent made.

  2. (b)

    Tests of equality of parameter vectors and, if equality of parameter vectors cannot be rejected, equality of error variance between survey waves.

  3. (c)

    Tests of equality of mean WTP (or WTP distributions) between survey waves.

Different statistical tests are used to assess the above dimensions of test–retest reliability. For example, Brouwer et al. (2017) use a Sign test for equality of choices, Liebe et al. (2012) a test of symmetry of test and retest choices proposed by Bowker (1948). Mørkbak and Olsen (2015) and Matthews et al. (2017) test general agreement of choices taking into account that respondents may choose the same alternative in two waves by chance using a correction factor for random matching, Cohen’s к (Cohen 1968). Mørkbak and Olsen (2015), Rigby et al. (2016) and Brouwer et al. (2017) use a parametric approach to explain choice consistency using panel data probit models.

Equality of parameter vectors across survey waves is typically tested following the Swait and Louviere (1993) procedure. Comparisons of mean WTP estimates across surveys conducted at different points in time can be based on a Krinsky and Robb (1986, 1991) procedure followed by the complete combinatorial test proposed by Poe et al. (2005), or Wald tests in case estimates are derived from models in WTP space rather than preference space models (Czajkowski et al. 2016; Brouwer et al. 2017). In addition to testing for equality of mean WTP distributions, Czajkowski et al. (2016) test for equality of variances in derived WTP distributions.

Findings across the half dozen or so applications of test-retest reliability in choice experiments in the environmental economics literature provide a mixed picture that, however, tends to rather suggest that choice experiments can provide reliable welfare estimates. Additional test-retest choice experiment studies can be found in the health economics literature, for example Bryan et al. (2000) and San Miguel et al. (2002). The following summary focuses on differences in WTP estimates across survey waves only. In a between-subject test-retest study with a time lag between survey waves of one year, Bliem et al. (2012) report no significant differences in WTP. Liebe et al. (2012) provide a within-subject test-retest study with survey waves being eleven months apart and find significant differences in WTP only for one attribute level. Czajkowski et al. (2016) find that means of WTP distributions are relatively stable over time (lag of six months), while variances are found to differ. They argue, however, that accounting for preference heterogeneity and correlations of random parameters is a more stringent test of preference equality across time periods. Comparing results of surveys conducted with a time lag of one year between survey waves, Schaafsma et al. (2014) report that mean WTP estimates were not found to be significantly different at the 5% level, but that estimates of compensating variation for policy scenarios can differ significantly. Rigby et al. (2016) and Mørkbak and Olsen (2015) find a relatively high degree of inter-temporal preference stability and choice consistency for \(t + 1\) = 6 months and \(t + 1\) = 2 weeks, respectively. Matthews et al. (2017) and Brouwer et al. (2017) each compare results for three survey waves, each wave being three months apart (Matthews et al. 2017) and six as well as 24 months after the initial survey wave (Brouwer et al. 2017). Both Matthews et al. (2017) and Brouwer et al. (2017) report significant differences in WTP estimates across survey waves. Brouwer et al. (2017) also provide a comparison of test-test reliability for choice experiment and open-ended SP question formats.

What may drive potential differences in choice consistency and preferences across time? An intuitive suspicion is that there were changes in the composition and/or characteristics of the sample, which had an influence on preferences; for example, income or education may have changed over time, or a participant has become more or less environmentally concerned. Therefore, all test-retest studies need to carefully control for potential differences in sample characteristics and/or composition over time. This highlights some of the trade-offs in choosing the time interval between survey waves. If the interval is very short, we can be reasonably confident that characteristics such as income or general environmental concern will not have changed between time periods. The longer the interval, the greater the likelihood that such factors have changed, and the greater the chance that unobserved and thus uncontrolled factors affecting preferences affect the test-retest “experiment”. However, shorter intervals between survey waves make it more likely that respondents remember their answers to the previous survey, or are influenced by their previous experience with the same survey. This would then question the independence of observations obtained from the same respondents across survey waves. Generally, test-retest studies may be subject to effects resulting from preference learning as analysed in Plott (1993) and in the context of unfamiliar public goods in Brouwer (2012). This most likely context dependent learning effect may or may not be invariant to the time lag between survey waves.

Another aspect related to the independence of observations across survey waves is experience in responding to choice experiment surveys and associated institutional learning. That is, respondents may learn how to evaluate choice alternatives and associated attribute trade-offs. However, institutional learning should theoretically only impact on error variance and not affect preference parameters. Respondents’ engagement with a survey may also be affected if they realise that they are being asked to respond to the exact same survey again. Again, one could argue that this may affect primarily choice consistency and hence error variance rather than preferences. However, it is conceivable that repeating the survey again may affect perceived consequentiality, which in turn may affect WTP. This will depend on how the repeat survey is introduced to respondents.

A number of studies have investigated factors influencing the likelihood of choice consistency in terms of congruence of choices facing the same choice tasks across survey waves (e.g. Mørkbak and Olsen 2015; Rigby et al. 2016; Brouwer et al. 2017). Aspects that were found to matter include choice complexity (e.g. using the entropy measure of complexity suggested by Swait and Adamowicz 2001), response times, cognitive capability of respondents, respondents’ experience with a good and measures of respondents’ stated certainty regarding choices.

It is important to note that it is more likely that the null hypothesis of differences between survey rounds in test-retest experiments is rejected if the variance of variables and parameter estimates that serve as measures of reliability increases. This is a function of sample size and other factors. Therefore, all else being equal, studies with larger samples are implicitly less likely to confirm test-retest reliability using the statistical tests mentioned above, while their internal estimates (i.e. estimates for each survey round) are actually more reliable. Concerning other factors that affect variance, this may for example include whether the information provided in the valuation scenario is clear and can be understood in a similar way by all survey respondents. In this way, a survey that contains confusing information on valuation scenarios (e.g. attribute and attribute level descriptions) is actually more likely to statistically confirm test-retest reliability than a survey where this information is provided in a clear and concise manner that has been thoroughly pretested for understanding using qualitative methods such as focus groups and “think aloud” protocols.

Given the multitude of potentially relevant factors influencing choices and thus choice consistency over time, clearly it is actually quite challenging to infer general statements on the reliability of the choice experiment method to obtain welfare estimates from a single test-retest experiment (Bishop and Boyle 2019). This may ultimately change as more test-retest studies of choice experiments become available.

8.3 Comparing Models

An important step when modelling discrete choices is the selection process of models which should be presented in e.g. a journal paper. Authors usually only present a few models, although they may have estimated 20 or 30 different models with different specifications. There are several ways to compare models, yet it is difficult to come up with a straightforward and unambiguous model choice (see, e.g., Sagebiel 2017 for a review of methods when choosing between an RP-MXL and an LCM). The data generating process is unknown and all efforts to identify the “true” model are—to some extent—speculative. In many cases, researchers base their decisions on statistical measures-of-fit and test results, and argue that the presented models are those that seem to fit the data best. However, model choice can also be based on the research question and the specific goals of the study. For example, if the sample size is low and the research interest is on preference heterogeneity, it may be pragmatic to go for a parsimonious RP-MXL or LCM rather than a highly parameterised LCRP-MXL model with error components (Train 2009). If the focus is on prediction, it may be a good idea to estimate a model with many parameters of which some have no theoretical or behavioural underpinning. Note, however, that in most applications in environmental economics, the focus is not on prediction. If the focus is on testing a theoretically derived hypothesis, a parsimonious model can be a better choice, as it is less prone to overfitting and multicollinearity. To break this down, choosing a model is ultimately based on the researcher’s own judgement, which is informed by several, sometimes contrasting criteria and the purpose of the research. As George Box puts it, “all models are wrong, but some are useful” (Box 1979, p. 202). Hence, the researcher’s task is to select the most useful model for a given dataset, purpose and context. 

There are two main strategies to compare models statistically. The first strategy is based on the estimated log-likelihood values and gives information on how well a model explains the observed data (i.e. the data used to estimate the model). However, it does not tell us how well the model explains/predicts choices. Basing model choice only on model fit bears the risk of overfitting a model. An overfitted model explains the observed data very well—but only the observed data. An overfitted model applied to a new data set likely performs worse than a more parsimonious model. The broad term cross validation describes a set of methods to identify how the estimated model performs in predicting out-of-sample choices.

Whichever route a researcher chooses, a first step should always be a visual inspection of the models. Are the parameters plausible? Do the models provide reasonable welfare estimates and meaningful distributions of WTP? By just looking at the model output, it may be possible to quickly detect inconsistencies in certain models.

8.3.1 Model Fit-Based Strategies to Choose Among Different Models

The easiest and quickest way to compare models is by looking at the log-likelihood value, the Pseudo \({R}^{2}\) and information criteria such as the Bayesian Information Criteria (BIC) and the Akaike Information Criteria (AIC). Finally, statistical tests can be used to find out whether a larger log-likelihood value in one model is statistically significantly larger than in the other model.

Goodness-of-fit measures are used for a general description of how well the model fits the data. The most widely used measure of the goodness-of-fit of discrete choice models is McFadden’s pseudo-R2, defined as

$$\textit{McFadden pseudo-}R^{2} = 1 - \frac{\ln L}{{\ln L_{0} }},$$

where \(\ln L\) is the likelihood function value at convergence and \(\ln L_{0}\) is the likelihood function value of the model including only alternative specific constants for all alternatives but one. Since it is always in the range \(\left[ {0,1} \right]\) and higher values represent a better fit, it is somewhat similar to the R2 statistic from linear models but note that the values of McFadden’s pseudo-R2 do not have a direct interpretation. Therefore, the value of McFadden’s pseudo-R2 is largely meaningless and it is unknown if, for example, 0.2 represents a “good” or “poor” model fit (Greene 2017). In recent years, more appealing alternatives have been proposed, such as Tjur’s pseudo-R2 (Tjur 2009).

A related approach for assessing the fit of the model and for comparing competing models is based on measures of information. In this regard, the information theory-founded AIC is commonly used:

$$AIC = - 2\ln L + 2 K,$$

where \(K\) is the number of parameters in the model. An alternative measure is the Bayesian (Schwarz) Information Criterion

$$BIC = - 2\ln L + K\ln N,$$

which imposes a higher penalty for a larger number of parameters (N). Note that these measures are not limited to be in the 0–1 range and lower values represent a better model fit. It is typical to report normalised AIC and BIC values, that is, divided by the number of observations. It is worth noting that although the goodness-of-fit measures can be compared between models to describe which model fits better, it is not possible to judge if the improvement in model fit is statistically significant or not.

There are several tests that can be used to compare model fit. The likelihood ratio test can be used to compare nested models. It is possible to test a conditional logit model against an RP-MXL, but it is not possible to test RP-MXL models against LCMs (as these are non-nested). A rarely used test to compare non-nested models has been proposed by Ben-Akiva and Swait (1986). This test is based on the AIC and provides a \(\chi^{2}\) statistic that (an arbitrary) Model 2 is the true model. An application of the test is provided, for example, in Sagebiel (2017).

8.3.2 Cross Validation

Cross validation, in general, refers to validating the model by applying the model to data which had not been used in model estimation. One simple strategy is to delete one observation from the sample, estimate the model and see how the model predicts for the left-out observation. This exercise is repeated several times and the average prediction error is calculated. The key advantage of this “leave-one-out” test is that it provides a very accurate estimate of how good the model performs in terms of robustness, as the whole data is used for estimation. The disadvantage of the leave-one-out test is that it is computationally intensive, as the same model has to be re-estimated several times. Therefore, it is appropriate for smaller samples and simple models. An alternative strategy is to randomly drop a certain percentage of the observations, and estimate the model without the dropped observations (hold-out sample). The estimated parameters are then used to predict the choices of the excluded observations. The number of correct predictions can be used as an indication of how well the model performs outside of the sample. This procedure is less computationally intensive than the leave-one-out test, because each model is estimated only once. The hold-out sample approach is therefore more adequate for larger samples and computationally intensive models. Although cross validation is less frequently used in environmental DCE applications, it is a very powerful way to investigate the ability of a model to predict and to identify overfitted models (Bierlaire 2016).

Choosing the correct model is a difficult task and requires researchers to inspect the model results from different perspectives. While the purpose of the research can guide model choice, statistical criteria should always be taken into account and reported. Likelihood-based measures and tests as well as cross validation are useful tools. However, no selection criteria identifies “the correct” model. In the end, it is down to the researcher’s own judgement to select a model.

8.4 Prediction

Generally, a researcher does not have enough information to accurately predict an individual’s choice. Therefore, choice models can only predict the probability that the individual will choose an alternative but not the individual’s choice. The percentage of individuals in the sample for which the highest-probability alternative and the chosen alternative coincide is called the per cent correctly predicted or simply hit rate.

It is important to bear in mind that predicting choice probabilities means that if the choice situation were repeated numerous times, each alternative would be chosen a certain proportion of the time. However, this does not mean that the alternative with the highest probability will be chosen each time. An individual can choose an alternative with the lowest probability in a specific choice occasion because of factors not observed by the researcher. This is why a widely used goodness-of-fit measure based on the “percent correctly predicted” should be avoided as it is opposed to the concept of probability. It assumes that the choice is perfectly predicted by the researcher by choosing the alternative for which the model gives the highest probability.

In some fields, the common approach towards forecasting is to estimate the best possible model and use it to predict the choices that lead to a prediction of quantity of interest based on the individual choices. This is a typical goal, for example, in transportation or marketing. Nevertheless, seeking an excellent in-sample fit can lead to an overfitted model that cannot offer much confidence in terms of out-of-sample forecast ability. In environmental valuation, the main focus is usually on the WTP values or welfare measures based on the estimated coefficients and not on predicted choices. Notwithstanding, if the alternatives are assigned to a specific environmental programme or action, the individual predictions can be relevant to identify appropriate policies.

Regarding the prediction of the probabilities of choosing an alternative, the literature mentions various problems related to this. The list includes, for example, the uncertainty in future alternatives, aggregation, or the aforementioned, overfitting. The aggregation problem can appear across individuals, alternatives or time. Discrete choice models are usually estimated at the level of individual decision-makers (allowing for heterogeneity and interdependencies among individuals) but the predicted quantity is aggregated (e.g. market share, average response to a policy change). The consistent way of aggregating over individuals is sample enumeration (Train 2009). To find a trade-off between the best model fit and models with the highest predictive performance is a relatively difficult task. A comprehensive description of the problems related to prediction in the choice models can be found in Habibi (2016).