Accurately measuring willingness to pay for consumer goods: a meta-analysis of the hypothetical bias

Consumers’ willingness to pay (WTP) is highly relevant to managers and academics, and the various direct and indirect methods used to measure it vary in their accuracy, defined as how closely the hypothetically measured WTP (HWTP) matches consumers’ real WTP (RWTP). The difference between HWTP and RWTP is the “hypothetical bias.” A prevalent assumption in marketing science is that indirect methods measure WTP more accurately than do direct methods. With a meta-analysis of 77 studies reported in 47 papers and resulting in 115 effect sizes, we test that assumption by assessing the hypothetical bias. The total sample consists of 24,347 included observations for HWTP and 20,656 for RWTP. Moving beyond extant meta-analyses in marketing, we introduce an effect size metric (i.e., response ratio) and a novel analysis method (i.e., multivariate mixed linear model) to analyze the stochastically dependent effect sizes. Our findings are relevant for academic researchers and managers. First, on average, the hypothetical bias is 21%, and this study provides a reference point for the expected magnitude of the hypothetical bias. Second, the deviation primarily depends on the use of a direct or indirect method for measuring HWTP. In contrast with conventional wisdom, indirect methods actually overestimate RWTP significantly stronger than direct methods. Third, the hypothetical bias is greater for higher valued products, specialty goods (cf. other product types), and within-subject designs (cf. between-subject designs), thus a stronger downward adjustment of HWTP values is necessary to reflect consumers’ RWTP.


Introduction
In a state-of-practice study of consumer value assessments, Anderson et al. (1992, p. 3) point out that consumers' willingness to pay (WTP) is Bthe cornerstone of marketing strategy^that drives important marketing decisions. First, consumers' WTP is the central input for price response models that inform optimal pricing and promotion decisions. Second, a new product's introductory price must be carefully chosen, because a poorly considered introductory price can jeopardize the investments in its development and threaten innovation failures (Ingenbleek et al. 2013). Not only do companies need to know what consumers are willing to pay early in their product development process, but WTP is also of interest to researchers in marketing and economics who seek to quantify concepts such as a product's value (Steiner et al. 2016). Obtaining accurate measures of consumers' WTP thus is essential.
Existing methods for measuring WTP can be assigned to a 2 × 2 classification (Miller et al. 2011), according to whether they measure WTP in a hypothetical or real context, with direct or indirect measurement methods (see Table 1). First, a hypothetical measure of WTP (HWTP) does not impose any financial consequences for participants' decisions.
Participants just state what they would pay for a product, if Mark Houston and John Hulland served as Special Issue Editors for this article.
given the opportunity to buy it. In contrast, participants may be required to pay their stated WTP in a real context, which provides a real measure of WTP (RWTP). This could for example be in the context of an auction, where the winner in the end actually has to buy the product. The difference between RWTP and HWTP is induced by the hypothetical context and is called Bhypothetical bias.^This hypothetical bias provides a measure of the hypothetical method's accuracy (Harrison and Rutström 2008). In case HWTP is measured with two different methods, the one with the lower hypothetical bias gives a more accurate estimate of participants' RWTP, increasing the estimate's validity. We conceptualize the hypothetical bias as the ratio of HWTP to RWTP. A method yielding an exemplary hypothetical bias of 1.5 shows that those participants overstate their RWTP for a product by 50% when asked hypothetically. Second, direct methods ask consumers directly for their WTP, whereas indirect methods require consumers to evaluate, compare, and choose among different product alternatives, and the price attribute is just one of several characteristics. Then, WTP can be derived from their responses.
Many researchers assume that direct methods create a stronger hypothetical bias, because they evoke greater price consciousness (Völckner 2006). In their pricing textbook, Nagle and Müller (2018) allege that direct questioning Bshould never be accepted as a valid methodology. The results of such studies are at best useless and are potentially highly misleading^(p. 186). Simon (2018) takes a similar line, stating, BIt doesn't make sense to ask consumers directly for the utility or their WTP, as they aren't able to give a direct and precise estimate. The most important method to quantify utilities and WTP is the conjoint analysis^(p. 53). Because indirect methods represent a shopping experience, they are expected to be more accurate for measuring HWTP (Breidert et al. 2006;Leigh et al. 1984;Völckner 2006). Still, practitioners largely continue to rely on direct survey methods, which tend to be easier to implement (Anderson et al. 1992;Hofstetter et al. 2013;Steiner and Hendus 2012).
Various studies specify the accuracy of one or more direct or indirect methods by comparing HWTP with RWTP. Yet no clear summary of these findings is available, 1 and considering the discrepancy between theory and practice, Bthere is a lack of consensus on the 'right' way to measure […] consumer's reservation price^ (Wang et al. 2007, p. 200). Therefore, with this study we seek to shed new light on the relative accuracy of alternative methods for measuring consumers' WTP, and particularly the accuracy of direct versus indirect methods. We perform a meta-analysis of existing studies that measure HWTP and RWTP for the same product or service, which reveals some empirical generalizations regarding accuracy. We also acknowledge the potential influence of other factors on the accuracy of WTP measures (Hofstetter et al. 2013;Sichtmann et al. 2011), such that we anticipate substantial heterogeneity across extant studies. With a meta-regression, we accordingly identify moderators that might explain this heterogeneity in WTP accuracy (Thompson and Sharp 1999;van Houwelingen et al. 2002). Our multivariate mixed linear model enables us to analyze the stochastically dependent effect sizes (ESs) explicitly (Gleser and Olkin 2009;Kalaian and Raudenbush 1996), which provides the most accurate way to deal with dependent ESs (van den Noortgate et al. 2013). As an effect size (ES) measure, we use the response ratio of HWTP and RWTP (Hedges et al. 1999), such that we obtain the relative deviation of HWTP. To the best of our knowledge, no previous meta-analysis in marketing has applied a mixed linear model nor a response ratio to measure ESs.
On average, the hypothetical bias is about 21%. In addition, direct methods outperform indirect methods with regard to their accuracy. The meta-regression shows that, compared with direct measurement methods, the hypothetical bias is considerably higher in indirect measures, by 10 percentage  (Carson et al. 1996;List and Gallet 2001;Murphy et al. 2005). However, they focus on public goods and their results are of limited use for marketing. In contrast to the existing meta-analyses, we focus on private goods and include several private good specific moderators of high interest for marketers. For a more detailed discussion of the three existing meta-analyses, please refer to Web Appendix A.
points in a full model. This finding contradicts the prevailing wisdom in academic studies but supports current practices in companies. In addition to the type of measurement, value of the product, product type, and type of subject design have a significant influence on the hypothetical bias.
In the next section, we prove an overview of WTP and its different measurement options. After detailing the data collection and coding, we explicate our proposed ES measure, which informs the analysis approach we take to deal with stochastically dependent ESs. We present the results and affirm their robustness with multiple methods. Finally, we conclude by highlighting our theoretical contributions, explaining the main managerial implications, and outlining some limitations and directions for further research.

Definition and classification
We take a standard economic view of WTP (or reservation price) and define it as the maximum price a consumer is willing to pay for a given quantity of a product or a service (Wertenbroch and Skiera 2002). At that price, the consumer is indifferent to buying or not buying, because WTP reflects the product's inherent value in monetary terms. That is, the product and the money have the same value, so spending to obtain a product is the same as keeping the money.

Hypothetical versus real WTP
The first dimension in Table 1 distinguishes between hypothetical and real contexts, according to whether the measure includes a payment obligation or not. Most measures of RWTP rely on incentive-compatible methods, which ensure it is the participant's best option to reveal his or her true WTP. Several different incentive-compatible methods are available (Noussair et al. 2004) and have been used in prior empirical studies to measure RWTP. However, all methods that measure RWTP require a finished, sellable version of the product. Therefore, practitioners regularly turn to HWTP during the product development process, before the final product actually exists. In addition, measuring RWTP can be difficult and expensive, for both practitioners and researchers. Therefore, the accuracy of HWTP methods is of interest to practitioners and academics alike. Because RWTP reflects consumers' actual valuation of a product, it provides a clear benchmark for comparison with HWTP. We integrate existing empirical evidence about the accuracy of various direct and indirect methods to measure HWTP.

Direct measures usually include open questions, such as,
BWhat is the maximum you would pay for this product?Ô ther methods use closed question formats (Völckner 2006) and require participants to state whether they would accept certain prices or not. Still others combine closed and open questions. The choice bracketing procedure starts with several closed questions, each of which depends on the previous answer. If consumers do not accept the last price of the last closed question, they must answer an open question about how much they would be willing to pay (Wertenbroch and Skiera 2002).
In particular, the most widely used direct measures of RWTP are the Vickrey auction (Vickrey 1961) and the Becker-DeGroot-Marschak lottery (BDM) (Becker et al. 1964). In a Vickrey auction, every participant hands in one sealed bid. The highest bidder wins the auction but pays only the price of the second highest bid; accordingly, these auctions also are called second-price sealed bid auctions. By disentangling the bid and the potential price, no bidding strategy is superior to bidding actual WTP. Different adaptions of these Vickrey auctions are available, such as the random nth price auction (Shogren et al. 2001), in which participants do not know the quantity being sold in the auction upfront. In contrast, a BDM lottery does not require participants to compete for the product. Instead, participants first state their WTP, and then a price is drawn randomly. If her or his stated WTP is equal to or more than the drawn price, a participant must buy the product for the drawn price. If the stated WTP is less than the drawn price, she or he may not buy the product. Similar to the Vickrey auction, the stated WTP does not influence the drawn price and therefore does not determine the final price. Again then, the dominant strategy is to state actual WTP.
Not all direct measures of RWTP are theoretically incentive compatible. For example, in an English auction, the price increases until only one interested buyer is left, who eventually buys the product for the highest announced bid. Every bidder has an incentive to bid up WTP (Rutström 1998), so an English auction reveals all bidders' WTP, except for the winner's, who stops bidding after the last competitor leaves. Therefore, the English auction is not theoretically incentive compatible, yet the mean RWTP obtained tend to be similar to those resulting from incentive-compatible methods (Kagel et al. 1987). Therefore, we treat studies using an English auction as direct measures of RWTP.
Finally, the online auction platform eBay can provide a direct measure of RWTP. Unlike a Vickrey auction, the auction format implemented in eBay allows participants to bid multiple times, and the auction has a fixed endpoint. Although multiple bids from one participant imply that not every bid reveals true WTP, the highest and latest bid does provide this information (Ockenfels and Roth 2006). Theoretically then, eBay auctions are not incentive compatible either (Barrot et al. 2010), but the empirical results from eBay and Vickrey auctions are highly comparable (Ariely et al. 2005;Bolton and Ockenfels 2014). Schlag (2008) gauges RWTP from eBay by exclusively using the highest bid from each participant but disregarding the winners' bid. We include this study in our meta-analysis as an example of a direct method.

Indirect methods to measure WTP
Among the variety of indirect methods to compute WTP (Lusk and Schroeder 2004), the most prominent is choicebased conjoint (CBC) analysis. Each participant chooses several times among multiple alternative products, including a Bno choice^option that indicates the participant does not like any of the offered products. Each product features several product attributes, and each attribute offers various levels. To measure WTP, price must be one of the attributes. From the collected choices, it is possible to compute individual utilities for each presented attribute level and, by interpolation, each intermediate value. Ultimately, WTP can be derived according to the following relationship (Kohli and Mahajan 1991), which is the most often used approach in the studies included in the meta-analysis: where u it | − p is the utility of product t excluding the utility of the price, and u i (p) is the utility for a price level p for consumer i. In accordance with Miller et al. (2011) and Jedidi and Zhang (2002), we define u * i as the utility of the Bno choice^option. The resulting WTP indicates the highest price p that still fulfills the relationship. In their web appendix, Miller et al. (2011) provide a numerical example.
In principle, indirect methods provide measures of HWTP, because the choices and other judgments expressed by the participants do not have any financial consequences. Efforts to measure RWTP indirectly attempt to insert a downstream mechanism that introduces a binding element (Wlömert and Eggers 2016). For example, Ding et al. (2005) propose to randomly choose one of the selected alternatives and make that choice binding. Every choice could be the binding one, so participants have an incentive to reveal their true preferences throughout the task. Ding (2007) also incorporates the idea of the BDM lottery, proposing that participants could take part in a conjoint task, from which it is possible to infer their WTP for one specific product, according to the person's choices in the conjoint task. The inferred WTP then enters the BDM lottery subsequently, so participants have an incentive to reveal their true preferences in the conjoint task.

Hypotheses
We predict that several moderators may affect the hypothetical bias. In addition, we control for several variables. The potential moderators constitute four main categories: (1) methods for measuring WTP, (2) research stimulus, (3) general research design of the study, and (4) the publication in which the study appeared. The last category only contains control variables.

Moderators: HWTP measurement
Direct methods for measuring HWTP have some theoretical drawbacks compared to indirect methods. First, asking consumers directly for their HWTP tends to prime them to focus on the price (Breidert et al. 2006), which is unlike a natural shopping experience in which consumers choose among several products that vary on multiple attributes. That is, direct methods may cause atypically high price consciousness (Völckner 2006). Indirect methods address this drawback by forcing participants to weigh the costs and benefits of different alternatives. Second, when asked directly, consumers might try to answer strategically if they suspect their answers might influence future retail prices (Jedidi and Jagpal 2009). Because indirect methods do not prompt participants to state their HWTP directly, strategic answering may be less likely. Third, direct statements of HWTP are cognitively challenging, whereas methods that mimic realistic shopping experiences require less cognitive effort (Brown et al. 1996).
Indirect methods for measuring HWTP also have some drawbacks that might influence the hypothetical bias. First, researchers using a CBC must take care to avoid a numberof-levels effect, especially in pricing studies (Eggers and Sattler 2009). To do so, they generally can test only a few different prices, which might decrease accuracy if the limitation excludes the HWTP of people with higher (lower) WTP than the highest (lowest) price shown. Second, indirect methods assume a linear relationship between price levels, through their use of linear interpolation (Jedidi and Zhang 2002).
Overall then, measuring HWTP with direct or indirect methods could evoke the hypothetical bias, and extant evidence is mixed (e.g. Miller et al. 2011), featuring arguments for the superiority of both method types. Therefore, we formulate two competing hypotheses.
H1a: Measuring HWTP with an indirect method leads to a smaller hypothetical bias compared to direct methods.
H1b: Measuring HWTP with a direct method leads to a smaller hypothetical bias compared to indirect methods.

Moderators: research stimulus
When asked for their HWTP, personal budget constraints do not exert an effect, because the consumer does not actually have to pay any money. However, when measuring RWTP, budget constraints limit the amount that participants may contribute (Brown et al. 2003). For low-priced products, this constraint should have little influence on the hypothetical bias, because the RWTP likely falls within this budget. For highpriced products though, budget constraints likely become more relevant; participants might state HWTP estimates that they could not afford in reality, thereby increasing the hypothetical bias. Thus, we hypothesize: H2: The hypothetical bias is greater for products with a higher value.
A classic categorization of consumer goods cites convenience, shopping, and specialty goods, depending on the amount of search and price comparison effort they require (Copeland 1923). Consumers engage in more search effort when they have trouble assessing a product's utility. Hofstetter et al. (2013) in turn show that the hypothetical bias decreases as people gain means to assess a product's utility, and in a parallel finding, Sichtmann et al. (2011) show that higher product involvement reduces the hypothetical bias. That is, higher product involvement likely reduces the need for intensive search effort. Therefore, we hypothesize: H3: The hypothetical bias is least for convenience goods, greater for shopping goods, and greatest for specialty goods.
Consumers face uncertainty about an innovative product's performance and their preferences for it (Hoeffler 2003). According to Sichtmann et al. (2011), stronger consumer preferences lower the hypothetical bias. In contrast, greater uncertainty reduces their ability to assess a product's utility, which increases the hypothetical bias (Hofstetter et al. 2013). Finally, Hofstetter et al. (2013) show that the perceived innovativeness of a product increases the hypothetical bias. Consequently,

H4:
The hypothetical bias is greater for innovations compared to established products.

Moderators: research design
The research design also might influence the hypothetical bias (List and Gallet 2001;Murphy et al. 2005). In particular, the subject design of an experiment determines the results, in the sense that between-subject designs tend to be more conservative (Charness et al. 2012), whereas within-subject designs tend to result in stronger effects (Ariely et al. 2006). Fox and Tversky (1995) identify stronger effects for a within-subject versus between-subject design in the context of ambiguity aversion; Ariely et al. (2006) similarly find such stronger effects for a within-subject design for a study comparing WTP and willingness to accept. According to Frederick and Fischhoff (1998), participants in a within-subject design express greater WTP differences for small versus large quantities of a product than do those in a betweensubject design. Therefore,

H5:
The hypothetical bias is greater for within-subject designs compared with between-subject designs.
Another source of uncertainty pertains to product performance, and it increases when the consumer can only review images (e.g., online) rather than inspect the product itself physically (Dimoka et al. 2012). Consequently, many consumers test products in a store to reduce their uncertainty before buying them online (showrooming) (Gensler et al. 2017). Similarly, consumers' uncertainty might be reduced in a WTP experiment by giving them an opportunity to inspect and test the product before bidding. Bushong et al. (2010) show that participants state a higher RWTP when real products, rather than images, have been displayed. As Hofstetter et al. (2013) note, greater uncertainty increases the hypothetical bias. We hypothesize: H6: Giving participants the opportunity to test a product before bidding reduces the hypothetical bias.
Finally, researchers often motivate participation in an experiment by paying some remuneration or providing an initial balance to bid in an auction. Equipping participants with money might change their RWTP, because they gain an additional budget. They even might consider this additional budget like a coupon, which they add to their original RWTP. Consumers in general overstate their WTP in hypothetical contexts, so providing a participation fee could decrease the hypothetical bias. Yet Hensher (2010) criticizes the use of participation fees, noting that they can bias participants' RWTP.
H7: Providing participants (a) a participation fee or (b) an initial balance decreases the hypothetical bias.

Collection of studies
With our meta-analysis, we aim to generalize empirical findings about the relative accuracy of HWTP measures, so we conducted a search for studies that report ESs of these measures. We used three inclusion criteria. First, the study had to measure consumers' HWTP and RWTP for the same product or service, so that we could determine the hypothetical bias. Second, the research stimulus had to be private goods or services. Third, we included only studies that reported the mean and standard deviation (or values that allow us to compute it) of HWTP and RWTP or for which the authors provided these values at our request.
To identify relevant studies, we applied a keyword search in different established online databases (e.g., Science Direct, EBSCO) and Google Scholar across all research disciplines and years. The keywords included Bwillingness-to-pay,B reservation price,^Bhypothetical bias,^and Bconjoint analysis.^We also conducted a manual search among leading marketing and economics journals. To reduce the risk of a publication bias, we extended our search to the Social Science Research Network, Research Papers in Economics, and the Researchgate network, and we checked for relevant dissertations whose results had not been published in journals. Moreover, we conducted a cross-reference search to find other studies. We contacted authors of studies that did not report all relevant values and asked them for any further relevant studies they might have conducted. Ultimately, we identified 77 studies reported in 47 articles, accounting for 117 ESs and total sample sizes of 24,441 for HWTP and 20,766 for RWTP.

Coding
As mentioned previously and as indicated by Table 2, we classify the moderators into four categories: (1) methods for measuring WTP, (2) research stimulus, (3) general research design of the study, and (4) the publication in which the study appears. In the first category, the main moderator of interest is the type of measurement HWTP, that is, the direct versus indirect measurement of HWTP. Two other moderators deal with RWTP measurement. Type of measurement RWTP similarly distinguishes between direct and indirect measures, whereas incentive compatible reflects the incentive compatibility (or not) of the method.
The second category of moderators, dealing with the research stimulus, includes value, or the mean RWTP for the corresponding product. The experiments in our meta-analysis span different countries and years, so we converted all values into U.S. dollars using the corresponding exchange rates. The variable variance ES captures participants' uncertainty and heterogeneity when evaluating a product. With regard to the products, we checked whether they were described as new to the consumer or innovations, which enabled us to code the innovation moderator. The moderator product/service distinguishes products and services. Finally, the product type moderator requires more subjective judgment. Two independent coders, unaware of the research project, coded product type by using Copeland's (1923) classification of consumer goods according to the search and price comparison effort they require, as convenience goods, shopping goods, or specialty goods. We use an ordinal scale for product type and therefore assessed interrater reliability with a two-way mixed, consistency-based, average-measure intraclass correlation coefficient (ICC) (Hallgren 2012). The resulting ICC of 0.82 is rated as excellent (Cicchetti 1994); the two independent coders agreed on most stimuli. The lack of any substantial measurement error indicates no notable influence on the statistical power of the subsequent analyses (Hallgren 2012). Any inconsistent codes were resolved through discussion between the two coders. We include product type in the analyses with two dummy variables for shopping and specialty goods, and convenience goods are captured by the intercept.
In the third category, we consider moderators that deal with the research design. The type of experiment HWTP and type of experiment RWTP capture whether the studies measure HWTP and RWTP in field or lab experiments, respectively. Experiments conducted during a lecture or class are designated lab experiments. Offline/online HWTP and offline/online RWTP indicate whether the experiment is conducted online or offline; the type of subject design reveals if researchers used a between-or within-subject design. The moderator opportunity to test indicates whether participants could inspect the product in more detail before bidding. Participation fee and initial balance capture whether participants received money for showing up or for spending in the auction, respectively. We identify a student sample when the sample consists of exclusively students; mixed samples are coded as not a student sample. Methods for measuring RWTP often are not self-explanatory, so researchers introduce them to participants, using various types of instruction. We focused on whether incentive compatibility concepts or the dominant bidding strategy were explained, using a moderator introduction of method for RWTP with four values. It equals Bnone^if the method was not introduced, Bexplanation^if the method and its characteristics were explained, Btraining^if mock auctions or questions designed to understand the mechanism occurred before the focal auction took place or questions were asked, and Bnot mentioned^if the study does not indicate whether the method was introduced. With this nominal scale, we include this moderator by using three dummy variables for explanation, training, and not mentioned, while the none category is captured by the intercept. Finally, we include region. Almost all the studies were conducted in North America or Europe; we distinguish North America from Bother countries (mostly Europe).T he fourth category of moderators contains publication characteristics. We checked whether a study underwent a peer review process (peer reviewed), reflected a marketing or economics research domain (discipline), how many citations it

Effect size
To determine the hypothetical bias induced by different methods, we need an ES that represents the difference between obtained values for HWTP and RWTP. When the differences stem from a comparison of a treatment and a control group, standardized mean differences (SMD) are appropriate measures (e.g. Abraham and Hamilton 2018;Scheibehenne et al. 2010). Specifically, to compute SMD, researchers divide the difference in the means of the treatment and the control group by the standard deviation, which helps to control for differences in the scales of the dependent variables in the experiments. Accordingly, it applies to studies that measure the same outcome on different scales (Borenstein et al. 2009, p. 25). In contrast, the ESs in our meta-analysis rely on the same scale; they differ in their position on the scale, because the products evoke different WTP values. In this case, the standard deviation depends on not only the scale range but also many other relevant factors, so the standard deviation should not be used to standardize the outcomes. In addition, as studies may have used alternate experimental designs, different standard deviations could be used across studies, leading to standardized mean differences that are not directly comparable (Morris and DeShon 2002). Rather than the SMD, we therefore use a response ratio to assess ES, because it depends on the group means only. Specifically, the response ratio is the mean outcome in an experimental group divided by that in a corresponding control group, such that it quantifies the percentage of variation between the experimental and control groups (Hedges et al. 1999). Unlike SMD, the response ratio applies when the outcome is measured on a ratio scale with a natural zero point, such as length or money (Borenstein et al. 2009). Accordingly, the response ratio often assesses ES in meta-analyses in ecology domains (Koricheva and Gurevitch 2014), for which many outcomes can be measured on ratio scales. To the best of our knowledge though, the response ratio has not been adopted in meta-analyses in marketing yet. However, it is common practice to specify a multiplicative, instead of a linear, model when assessing the effects of marketing instruments on product sales or other outcomes (Leeflang et al. 2015). Hence, it would be a natural option to use an effect measure representing proportionate changes, instead of additive changes, when deriving empirical generalizations on marketing subjects like response effects to mailing campaigns. For our effort, we define the response ratio as Whether the study was peer reviewed.

Economics
Dummy variable (marketing = 1) Corresponding research discipline

Citations
Metric variable

Number of citations in Google Scholar
Year

Metric variable
Year the study was published Moderators in italics are control variables where μ HWTP and μ RWTP are the means of a study's corresponding HWTP and RWTP values. For three reasons, we run statistical analyses using the natural logarithm of the response ratio as the dependent variable. First, the use of the natural logarithm linearizes the metric, so deviations in the numerator and denominator have the same impact (Hedges et al. 1999). Second, the parameters (β) for the moderating effects in the meta-regression are easy to interpret, as a multiplication factor, by taking the exponent of the estimate (Exp(β)). Most moderators are dummy variables, and a change of the corresponding dummy value results in a change of (Exp(β) − 1) * 100% in the hypothetical bias. However, this point should not be taken to mean that the difference of the hypothetical bias between two conditions of a moderator is Exp(β) − 1 percentage points, because that value depends on the values of other moderators. Third, the distribution of the natural logarithm of response ratios is approximately normally distributed (Hedges et al. 1999). Consequently, we define ES as: Modeling stochastically dependent effect sizes explicitly Most meta-analyses assume the statistical independence of observed ESs, but this assumption only applies to limited cases; often, ESs are stochastically dependent. Two main types of dependencies arise between studies and ESs. First, studies can measure and compare several treatments or variants of a type of treatment against a common control. In our context, for example, a study might measure HWTP with different methods and compare the results to the same RWTP, leading to multiple ESs that correlate because they share the same RWTP. Treating them as independent would erroneously add RWTP to the analysis twice. This type of study is called a multiple-treatment study (Gleser and Olkin 2009). Second, studies can produce several dependent ESs by obtaining more than one measure from each participant. For example, a study might measure HWTP and RWTP for several products from the same sample. The resulting ESs correlate, because they are based on a common subject. This scenario represents a multiple-endpoint study (Gleser and Olkin 2009). There are different approaches for dealing with stochastically dependent ESs, such as ignoring or avoiding dependence, or else modeling dependence stochastically or explicitly (Bijmolt and Pieters 2001;van den Noortgate et al. 2013). In marketing research, it is still common, and also suggested to avoid dependent ESs (Grewal et al. 2017). However, nested data structures and the associated dependent ESs are prominent in marketing research, so Bijmolt and Pieters (2001) suggest using a three-level model to account for dependency, by adding error terms on all levels. In turn, marketing researchers started to model dependence stochastically by applying multi-level regression models (e.g. Abraham and Hamilton 2018;Arts et al. 2011;Babić Rosario et al. 2016;Bijmolt et al. 2005;Edeling and Fischer 2016;Edeling and Himme 2018). However, when additional information about correlations among the ESs are available, it is most accurate to model dependence explicitly by incorporating the dependencies in the covariance matrix at the within-study level (Gleser and Olkin 2009). In contrast to modeling dependence stochastically, the covariances are not estimated but rather are calculated on the basis of the provided information. To the best of our knowledge, this approach has not been applied by metaanalyses in marketing previously.
To model stochastic dependence among ESs explicitly, we follow Kalaian and Raudenbush (1996) and use a multivariate mixed linear model with two levels: a within-studies level and a between-studies level. On the former, we estimate a complete vector of the corresponding K true ESs, α i = (α 1i , … , α Ki ) T , for each study i. However, not every study examines all possible K ESs, so the vector of ES estimates for study i, ES i ¼ ES 1i ; …; ES L i i ð Þ T , contains L i of the total possible K ESs, and by definition, L i ≤ K. That is, K equals the maximum number of dependent ESs in one study (i.e., six in our sample), and every vector ES i contains between one and six estimates. The first-level model regresses α ki on ES i with an indicator variable Z lki , which equals 1 if ES li estimates α ki and 0 otherwise, according to the following linear model: or in matrix notation, The first-level errors e i are assumed to be multivariate normal in their distribution, such that e i~N (0, V i ), where V i is a K i × K i covariance matrix for study i, or the multivariate extension of the V-known model for the meta-regression. The elements of V i must be calculated according to the chosen ES measure (see Web Appendix B; Gleser and Olkin 2009;Lajeunesse 2011). In turn, they form the basis for modeling the dependent ESs appropriately. The vector α i of a study's true ES is estimated by weighted least squares, and each observation is weighted by the inverse of the corresponding covariance matrix (Gleser and Olkin 2009).
The linear model for the second stage is or in matrix notation where the K ESs become the dependent variable. The residuals u ki are assumed to be K-variate normal with zero average and a covariance matrix τ. Then X i reflects the moderator variables. By combining both levels, the resulting model is Estimates for τ are based on restricted maximum likelihood. The analysis uses the metafor package for metaanalyses in R (Viechtbauer 2010).

Data screening and descriptive statistics
One of the criticisms of meta-analyses is the risk of publication bias, such that all the included ESs would reflect the nonrandom sampling procedure. Including unpublished studies can address this concern; in our sample, 22 of 117 ESs come from unpublished studies, for an unpublished work proportion of 19%, which favorably compares with other meta-analyses pertaining to pricing, such as 10% in Tully and Winer (2014), 9% in Bijmolt et al. (2005), or 16% in Abraham and Hamilton (2018). The funnel plot for the sample, as depicted in Fig. 1, is symmetric, which indicates the absence of a publication bias. Finally, as the competing H1a and H1b indicate, we do not expect a strong selection mechanism in research or publication processes that would favor significant or high (or low) ESs. Thus, we do not consider publication bias a serious concern for our study.
To detect outliers in the data, we checked for extreme ESs using the boxplot (see Web Appendix D, Figure WA2). We are especially interested in the moderator type of measurement HWTP, so we computed separate boxplots for the direct and indirect measures of HWTP and thereby identified one observation for each measurement type (indirect Kimenju et al. 2005;direct Neill et al. 1994) for which the ESs (0.9079; 0.9582) exceeded the upper whisker, defined as the 75% quantile plus 1.5 times the box length. Kimenju et al. (2005) report HWTP ($11.68) values from an indirect method that overestimate RWTP ($94.48) by a factor of eight; we excluded it from our analyses. Neill et al. (1994) report HWTP ($109) that overestimates RWTP ($12) by a factor of nine when excluding outliers, and it is another outlier in our database. Thus, we excluded two of 117 observations, or less than 5% of the full sample, which is a reasonable range (Cohen et al. 2003, p. 397

Peer reviewed
No Yes an underestimation of RWTP, resulting from direct (12) and indirect (4) methods. Table 3 contains an overview of the moderators' descriptive statistics. Type of measurement HWTP reveals some mean differences between direct (0.1818) and indirect (0.2280) measures, which represents model-free support for H1b. The descriptive statistics of product type suggest a higher mean ES for specialty goods (0.2911) than convenience (0.1954) or shopping (0.1399) goods, in accordance with H3. With regard to innovation, we find a higher ES mean for innovative (0.2287) compared with non-innovative (0.1760) products, as we predicted in H4. Model-free evidence gathered from the moderators that reflect the research design also supports H5, in that the mean for between-subject designs is lower (0.1800) than that for within-subject designs (0.2798). The descriptive statistics cannot confirm H6 though, because giving participants an opportunity to test a product before stating their WTP increases the ES (0.2525) relatively to no such opportunity (0.1626). We also do not find support for H7 in the model-free evidence, because studies with an initial balance and participation fee report higher ESs than those without.

Results
To address our research questions about the accuracy of WTP measurement methods and the moderators of this performance, we performed several meta-regressions in which we varied the moderating effects included in the models. First, we ran an analysis without any moderators. Second, we ran a meta-regression with all the moderators that met the multicollinearity criteria. Third, we conducted a stepwise analysis, dropping the non-significant moderators one by one.    Significance codes: *** p < 0.01; ** p < 0.05; * p < 0.1 Moderators in italics are control variables The first model, including only the intercept, results in an estimate (β) of 0.1889 with a standard error (SE) of 0.0183 and a p value < .0001. The estimate corresponds to an average hypothetical bias of 20.79% (Exp(0.1889) = 1.2079), meaning that on average, HWTP overestimates RWTP by almost 21%.
The analysis with all the moderators that met the multicollinearity threshold produces the estimation results in Table 4. The type of measurement HWTP has a significant, positive effect (β = 0.1027, Exp(β) = 1.1082, SE = 0.0404, p = 0.0110), indicating that indirect measures overestimate RWTP more than direct measures do. We reject H1a and confirm H1b. In particular, the ratio of HWTP to RWTP should be multiplied by 1.1082, resulting in an overestimation by indirect methods of an additional 10.82%. Value has a significant, positive effect at the 10% level (β = 0.0002, Exp(β) = 1.0002, SE = 0.0001, p = 0.0656), in weak support of H2. The percentage overestimation of RWTP by HWTP increases slightly, by an additional 0.02%, with each additional U.S. dollar increase in value. For H3, we find no significant difference in the hypothetical bias between convenience and shopping goods, yet specialty goods evoke a significantly higher hypothetical bias than convenience goods (β = 0.1615, Exp(β) = 1.1753, SE = 0.0476, p < .0001). This finding implies that the hypothetical bias is greater for products that demand extraordinary search effort, as we predicted in H3. We do not find support for H4, because innovation does not influence the hypothetical bias significantly (β = − 0.0004, Exp(β) = 0.9996, SE = 0.0505, p = 0.9944).
Finally, we ran analyses in which we iteratively excluded moderators until all remaining moderators were significant at the 5% level. We excluded the moderator with the highest p value from the full model, reran the analysis, and repeated this procedure until we had only significant moderators left. We treated the dummy variables from the nominal/ordinal moderators product type and introduction of method for RWTP as belonging together, and we considered these moderators as significant when one of the corresponding dummy variables showed a significant effect. The exclusion order was as follows: innovation, type of experiment RWTP, type of measurement RWTP, opportunity to test, year, variance ES, incentive compatible, initial balance, citations, participation fee, region, value, type of subject design, and offline/online HWTP. The results in Table 4 reconfirm the support for H1b, because the type of measurement HWTP has a positive, significant effect (β = 0.0905, Exp(β) = 1.0947, SE = 0.0382, p = 0.0177), resulting in a multiplication factor of 1.0947. The overestimation of RWTP increases considerably for measures of WTP for specialty goods (β = 0.1624, Exp(β) = 1.1763, SE = 0.0393, p < .0001), in support of H3. Yet we do not find support for any other hypotheses in the reduced model. Notes: The base scenario is as follows: product type = convenience good,introduction of method for RWTP = explanation,student sample = no. Regarding the control variables, student sample (β = − 0.1026, Exp(β) = 0.9025, SE = 0.0344, p = 0.0021) again has a significant effect, and introduction of method for RWTP affects the hypothetical bias significantly. In this case, the hypothetical bias increases when the article does not mention any introduction of the method for measuring RWTP to participants (β = 0.1546, Exp(β) = 1.1672, SE = 0.0524, p = 0.0032) and when the method involves mock auctions (β = 0.2032, Exp(β) = 1.2253, SE = 0.0604, p = 0.0008).
For ease of interpretation, we depict the hypothetical bias for different scenarios in Fig. 2. The reduced model provides a better model fit, according to the corrected Akaike information criterion (AICc) (AICc full model = 45.61, AICc reduced model-= − 23.49), so we use it as the basis for the simulation. The base scenario depicted in Fig. 2 measures WTP for convenience goods, explains the method for measuring RWTP to participants, and does not include solely students. The other scenarios are adaptions of the base scenario, where one of the three aforementioned characteristics is changed. In the base scenario, we predict that direct measurement overestimates RWTP by 9%, and indirect measurement overestimates it by 19%, so the difference is 10 percentage points. In contrast, for specialty goods, the overestimation increases to 28% for direct and to 40% for indirect measures. When using a pure student sample instead of a mixed sample, the predictions are relatively accurate. Here, direct measurement even underestimates RWTP by 2%, while indirect measurement yields an overestimation of 7%. With respect to how the method for measuring RWTP is introduced to the participants, not mentioning it in a paper, as well as training the method beforehand increase the hypothetical bias. While the first option is hardly interpretable, running mock tasks increases the bias to 33% in case of direct and to 46% in case of indirect methods used for measuring HWTP.

Robustness checks
We ran several additional analyses to check the robustness of the results, which we summarize in Table WA2 in Web Appendix F. To start, we analyzed Model 1 in Table WA2 by applying a cut-off value of GVIF 1= 2*df ð Þ < ffiffiffiffiffi 10 p , comparable to the often used cut-off value of 10 for the VIF. In this case, we did not need to exclude any moderator, but the results do not deviate in their signs or significance levels relatively to the main results. Type of measurement HWTP still has a significant effect (5% level) on the hypothetical bias. In addition, value, product type (specialty), and type of subject design exert significant influences. Among the control variables, introduction of method for RWTP (training), introduction of method for RWTP (not mentioned), region, and peer reviewed have significant effects (5% level). The moderators excluded from the main models due to multicollinearity (product/service, type of experiment HWTP, offline/online RWTP, and discipline) do not show significant influences.
Next, we estimated two models with all ESs, including the two outliers, but varied varied the number of included moderators (Models 2 and 3 in Table WA2). The results remain similar to our main findings. Perhaps most important, the type of measurement HWTP has a significant effect on the hypothetical bias, comparable in size to the effect in the main model.
In addition, instead of the multivariate mixed linear model, we used a random-effects, three-level model, such that the ES measures nested within studies with a V-known model at the lowest level (Bijmolt and Pieters 2001;van den Noortgate et al. 2013), which can account for dependence between observations. We estimated the two main models and the three robustness check models with this random-effects three-level model (Models 4-8 in Table WA2). Again, the results do not change substantially, except for value, which becomes significant at the 5% level.
Finally, we tested for possible interaction effects. That is, we took all significant moderators from the full model and tested, for each significant moderator, all possible interactions. The limited number of observations prevented us from simultaneously including all interactions in one model. Therefore, we first estimated separate models for each of the significant moderators from the full model, after dropping moderators due to multicollinearity until all moderators had a GVIF 1/(2 * df) < 2. Then, we estimated an additional extension of the full model by adding all significant interactions that emerged from the previous interaction models. We next reduced that model until all moderators were significant at a 5% level. The resulting model achieved a higher AICc than our main reduced model. Comparing all full models with interactions, the model with the lowest AICc (Burnham and Anderson 2004) did not feature a significant interaction, indicating that the possible interactions are small and do not affect our results. All of these models are available in Web Appendix F.

Theoretical contributions
Though three meta-analyses discussing the hypothetical bias exist (Carson et al. 1996;List and Gallet, 2001;Murphy et al. 2005), this is the first comprehensive study giving marketing managers and scholars advices on how to accurately measure consumers' WTP. In contrast to the existing meta-analyses, we focus on private goods, instead of on public goods, increasing the applicability of our findings within a marketing context. 2 With a meta-analysis of 115 ESs gathered from 77 studies reported in 47 papers, we conclude that HWTP methods tend to overestimate RWTP considerably, by about 21% on average. This hypothetical bias depends on several factors, for which we formulated hypotheses (Table 5) and which we discuss subsequently.
With respect to the method for measuring HWTP, whether direct or indirect, across all the different models, we find strong support for H1b, which states that indirect methods overestimate HWTP more severely than direct methods. This important finding contradicts the prevailing opinion among academic researchers (Breidert et al. 2006) and has not previously been revealed in meta-analyses (Carson et al. 1996;List and Gallet 2001;Murphy et al. 2005). We in turn propose several potential mechanisms that could produce this surprising finding. First, we consider the concept of coherent arbitrariness, as first introduced by Ariely et al. (2003). People facing many consecutive choices tend to base each decision on their previous ones, such that they show stable preferences. However, study participants might make their first decision more or less randomly. Indirect measures require many, consecutive choices, so coherent arbitrariness could arise when using these methods to measure WTP. In that sense, the results of indirect measures indicate stable preferences, but they do not accurately reflect the participants' actual valuation. Second, participants providing indirect measure responses might focus less on the absolute values of an attribute and more on relative values (Drolet et al. 2000). The absolute values of the price attribute are key determinants of WTP, so the hypothetical bias might increase if the design of the choice alternatives does not include correct price levels. A widespread argument for the greater accuracy of indirect methods compared with direct methods asserts they mimic a natural shopping experience (Breidert et al. 2006); our analysis challenges this claim.
In our results related to H2, the p value of the value moderator is slightly greater than 5% in the full model, such that the hypothetical bias appears greater for more valuable products in percentage terms, though the effect is relatively small. Value does not remain in the reduced model, but the significant effect is very consistent across the robustness checks that feature the full model (Table 5). Therefore, our results support H2: The hypothetical bias increases if the value of the products to be evaluated increases. This finding is new, in that neither existing meta-analyses (Carson et al. 1996;List and Gallet 2001;Murphy et al. 2005) nor any primary studies have examined this moderating effect.
We also find support for H3 across all analyzed models. For participants it is harder to evaluate a specialty product's utility than a convenience product's utility; specialty goods often feature a higher degree of complexity or are less familiar to consumers than convenience goods. The greater ability to assess the product's utility reduces the hypothetical bias (Hofstetter et al. 2013), such that our finding of higher overestimation for specialty goods is in line with prior research. Yet we do not find any difference between shopping and convenience goods, prompting us to posit that the hypothetical bias might not be affected by moderate search effort; rather, only products demanding strong search effort increase the hypothetical bias. Existing meta-analyses (Carson et al. 1996;List and Gallet 2001;Murphy et al. 2005) include public goods and do not distinguish among different types of private goods. By showing that the type of a private good influences the hypothetical bias, we add to an understanding of the hypothetical bias in a marketing context that features private goods.
With respect to innovation, we find no support for H4, because the differences between innovations and existing products are small and not significant. This finding contrasts with Hofstetter et al.'s (2013) results. Accordingly, we avoid rejecting the claim that methods for measuring HWTP work as well (or as poorly) for innovations as they do for existing products.
2 Please refer to Web Appendix A for a more detailed discussion of the existing meta-analyses. Initial balance decreases the bias A within-subject research design increases the hypothetical bias, compared with a between-subject design, as we predicted in H5 and in accordance with prior research (Ariely et al. 2006, Fox and Tversky 1995, Frederick and Fischhoff 1998. Yet this finding still seems surprising to some extent. When asking a participant for WTP twice (once hypothetically, once in a real context), the first answer seemingly should serve as an anchor for the second, leading to an assimilation expected to reduce the hypothetical bias. Instead, two similar questions under different conditions appear to evoke a contrast instead of an assimilation effect, and they produce a greater hypothetical bias. Consequently, when designing marketing experiments to investigate the hypothetical bias, researchers should use a between-subject design to prevent the answers from influencing each other. When researching the influence of consumer characteristics on the hypothetical bias though, it would be more appropriate to choose a within-subject design (Hofstetter et al. 2013), though researchers must recognize that the hypothetical bias might be overestimated more severely in this case. Murphy et al. (2005) also distinguish different subject designs in their meta-analysis and find a significant effect, though they use RWTP instead of the difference between HWTP and RWTP as their dependent variable. In this sense, our finding of a moderating role of the study design on the hypothetical bias is new to the literature.
Our results do not support H6; we do not find differences in the hypothetical bias when participants have an opportunity the test a product before stating their WTP or not. Testing a product in advance reduces uncertainty about product performance, and our finding is in contrast with Hofstetter et al.'s (2013) evidence that higher uncertainty increases the hypothetical bias. Note however, that the result by Hofstetter et al.'s (2013) refers to an effect of a consumer characteristic, and might be specific to the examined product, namely digital cameras. Our results are more general across a wide range of product categories and experimental designs. Furthermore, this result on H6 is in line with our findings for H4; both hypotheses rest on the participants' uncertainty about product performance, and we do not find support for either of them.
Finally, neither a participation fee nor initial balance reduce the hypothetical bias significantly, so we find no support for H7a or H7b. Formally, we can only Bnot reject^a null hypothesis of no moderator effect, but these findings suggest that we can dispel fears about influencing WTP results too much by offering participation fees or an initial balance.
In addition to these theoretical insights on WTP measures, we contribute to marketing literature by showing how to model stochastically dependent ESs explicitly when the covariances and variances of the observed ESs are known or can be computed. Moreover, we use (the log of) the response ratios as the ES in our meta-analysis, which has not been done previously in marketing. We provide a detailed rationale for using response ratios and thus offer marketing scholars another ES option to use in their meta-analyses.

Managerial implications
This meta-analysis identifies a substantial hypothetical bias of 21% on average in measures of WTP. Although hypothetically derived WTP estimates are often the best estimates available, managers should realize that they generally overestimate consumers' RWTP and take that bias into account when using HWTP results to develop a pricing strategy or when setting an innovation's launch price. In addition, we detail conditions in which the bias is larger or smaller, and we provide a brief overview of how extensive the expected biases might become. In particular, managers should anticipate a greater hypothetical bias when measuring WTP for products with higher values or for specialty goods. For example, when measuring HWTP for specialty goods, direct methods overestimate it by 28% and indirect methods do so by 40%. These predicted degrees of RWTP overestimation should be used to adjust decisions based on WTP studies in practice.
The study at hand also shows that direct methods result in more accurate estimates of WTP than indirect methods do. Therefore, practitioners can resist, or at least consider with some skepticism, the prevalent academic advice to use indirect methods to measure WTP. In addition to being less accurate, indirect methods require more effort and costs (Leigh et al. 1984). However, this recommendation only applies if the measurement of HWTP is necessary. If RWTP can be measured with an auction format, that option is preferable, since RWTP reflects actual WTP, whereas HWTP tends to overestimate it. This result also implies an exclusive focus on measuring WTP for a specific product, such that it disregards some advantages of the disaggregate information provided by indirect methods (e.g., demand due to cannibalization, brand switching, or market expansion; Jedidi and Jagpal 2009). In summary, the key takeaway for managers who might use direct measures of HWTP is that the Bquick and dirty solution^is only quick, not dirty-or at least, not more dirty than indirect methods.

Limitations and research directions
This meta-analysis suggests several directions for further research, some of which are based on the limitations of our meta-analysis. First, several recent adaptations of indirect methods seek to improve their accuracy (Gensler et al. 2012, Schlereth andSkiera 2017). These improvements might reduce the variance in measurement accuracy between direct and indirect measurements. These recently developed methods have not been tested by empirical comparison studies, so we could not include them in our meta-analysis. An extensive comparison of those adaptions, in terms of their effects on the hypothetical bias, would provide researchers and managers more comprehensive insights for choosing the right method when measuring WTP.
Second, the prevailing opinion of indirect methods yielding a lower hypothetical bias than direct methods bases upon assumptions concerning individuals' decision making; though our results are in contrast with this opinion. The underlying mental processes when asked for the WTP through direct or indirect methods are not well understood yet. Investigating those processes would foster the understanding of differences in the hypothetical bias between direct and indirect methods and between other experimental conditions. This would enable the development of new adaptions minimizing the hypothetical bias.
Third, the hypothetical bias depends on a variety of factors, including individual-level considerations (Hofstetter et al. 2013;Sichtmann et al. 2011), that extend beyond the product or study level moderators as examined in our meta-regressions. Very few studies have investigated these factors, so we could not incorporate them in our meta-analysis, though consumer characteristics likely explain some differences. Therefore, we call for more research on whether and how individual characteristics influence the hypothetical bias. For example, a possible explanation for the limited accuracy of indirect measures could reflect coherent arbitrariness (Ariely et al. 2003). Continued research might examine whether and how coherent arbitrariness affects different consumers, especially in the context of CBCs. In addition, our findings on some product-level factors are new, namely that the hypothetical bias is greater for higher valued products and for specialty goods. These results could be cross-validated in future experimental studies.
Fourth, knowing and measuring WTP is crucial for firms operating in business-to-business (B2B) contexts (Anderson et al. 1992), yet all ESs in our study are from a business-toconsumer context. Because B2B products and services tend to be more complex, customers might prefer to identify product characteristics and to include them separately when determining their WTP in response to an indirect method. However, anecdotal evidence indicates that direct measurement works better for industrial goods than for consumer goods (Dolan and Simon 1996). Researching the differential accuracy of the various methods in a B2B context would be especially interesting; our study already indicates differences between convenience and (more complex) specialty goods. Therefore, we join Lilien (2016) in calling for more research in B2B marketing, including the measurement of WTP.
Fifth, the majority of studies included herein used open questioning as the direct method for measuring WTP. In practice, different direct methods are available (Steiner and Hendus 2012), yet they rarely have been investigated in academic research. Pricing research could increase in managerial relevance (Borah et al. 2018), and help managers make better pricing decisions, if it included assessments of different direct methods for measuring WTP.