1 Introduction

Nearly unlimited choice and pre-purchase uncertainty regarding a product’s quality has sparked considerable interest in developing ways of supporting customers’ decision making in online shopping (e.g., Brynjolfsson et al. 2003; Gill et al. 2012; Dimoka et al. 2012). To reduce uncertainty, retailers provide manufacturer-independent, customer-written reviews on their websites. Customer reviews are trusted (Bickart and Schindler 2001), increase sales and purchase intentions (Berger et al. 2010; Ghose and Ipeirotis 2011; Park et al. 2007; Chen et al. 2010; for a survey, see Dellarocas 2003), increase post-purchase satisfaction (Stephen et al. 2012), reduce return rates (Sahoo et al. 2018) and increase perceived website usefulness (Kumar and Benbasat 2006). Focusing on the book market, Reimers and Waldfogel (2020) estimate the yearly welfare effects of customer reviews to be $ 41 million for the U.S. Several companies have even built successful businesses around soliciting, collecting and distributing customer reviews to online retailers.Footnote 1

Customers do not, however, consider all reviews but instead search for high-quality and trustworthy information (Chen et al. 2008). Only high-quality reviews will be perceived as helpful and will affect customers’ purchasing decisions (Pavlou et al. 2007). Hence, any successful review system will positively affect both the quantity and the quality of reviews. These requirements have led to different incentive schemes, all trying to induce customers to write high-quality reviews. In this paper, we abstract from the quantity-question and analyze how different incentives affect the quality of reviews. More precisely, we compare two different often used incentive schemes: a pay-per-review scheme, in which payment is independent of quality, and a tournament incentive scheme, in which reviewers receive a bonus contingent on the relative quality of the review, as measured by helpfulness ratings assigned by others.

Several retailers run a pay-per-review scheme. Customers receive a fixed payment or a product in exchange for a review. Some review platforms (e.g., www.ilovetoreview.com) connect retailers, who are willing to give away their products for free in exchange for a review, and customers, who agree to write a review in exchange for the product. This could be problematic because, if the costs of writing a review increase with review quality, a money-maximizing customer will write a large number of low-quality reviews.Footnote 2

To generate high-quality reviews, some retailers condition payment on relative review quality. In order to judge a review’s quality, retailers look at the helpfulness votes, which are assigned by customers. In its incentive program “Vine Club”, a selected group of Amazon customers receive pre-release products free of charge in exchange for a review. The specific eligibility criteria for Vine reviewers are not publicized, but Amazon admits that the helpfulness of previously written reviews plays a large role. Essentially, “Vine Club” is a tournament incentive scheme in which the bonus consists of being admitted to and remaining in the Vine program. A similar tournament incentive scheme is Yelp’s Elite Squad program, in which selected reviewers are invited to exclusive events.Footnote 3 A tournament incentive scheme can give rise to strategic downvoting, as the following example illustrates.

Soon after the initiation of Amazon’s Vine program, Vine members started to complain that their reviews were accumulating inexplicably high numbers of negative helpfulness ratings. They suspected that fellow reviewers were systematically “voting down” their reviews to oust and replace them as Vine Club members, or to protect their own membership status.Footnote 4 Here, an implicit assumption is that a review’s quality is measured in relative terms. By assigning a bad helpfulness rating to a fellow’s review, the reviewer’s own review will be perceived as better relative to the fellow’s review.Footnote 5 Because the incentive to assign the lowest possible helpfulness rating is strategic (retaining VINE membership), we call such behavior strategic downvoting.Footnote 6

This example reveals a potential weakness of monetary incentives which are based on relative quality, as measured by helpfulness ratings: Due to strategic downvoting, helpfulness ratings may be biased. Possibly, this could discourage reviewers and lead to a decrease in review quantity and quality. It may also imply that a review’s helpfulness rating ceases to signal the review’s quality to other customers. These problems exist not only if most customers are motivated by quality-based monetary incentives. But, it may also be relevant if there is a large population of silent customers, who never write reviews, as we will argue in Sect. 5.

The empirical evidence so far is inconclusive. Closest to our study are Stephen et al. (2012) and Wang et al. (2012). Stephen et al. (2012) show that a flat salary ($ 1 per review) increases the helpfulness of the reviews.Footnote 7 Wang et al. (2012) find no effect of a quality-contingent payment ($ 0.25 per helpfulness point) on review helpfulness. A systematic comparison of a flat salary and a tournament incentive scheme has, to the best of our knowledge, not been conducted. Our paper tries to fill this gap.

Using a controlled laboratory experiment and an online survey, we compare a pay-per-review scheme and a tournament incentive scheme. Participants write product reviews and vote about the helpfulness of others’ reviews. Writing reviews is implemented as a public good setting where the whole group profits from written reviews. We compare how the different incentive schemes affect review quality and the assignment of helpfulness ratings. The experiment allows us to account for confounding factors such as collusion or social pressure, which are present in the field. Also, in reality, many customers write reviews only when they are extremely happy or frustrated with the product. In the experiment, each participant has to write reviews, which allows us to abstract from the question “Who writes a review?” and allows us to isolate the effect of incentives on review quality. Most importantly, the random assignment to treatments implies that differences in review quality cannot be explained by differences in reviewers’ experience or reliability. The exogenous randomization in experimental settings provides a clear methodological advantage over using field data (Falk and Heckman 2009; Bardsley et al. 2010).

The remainder of the paper is organized as follows. We discuss the theoretical background and derive our hypotheses in Sect. 2. In Sect. 3, we introduce the designs of experiment and survey, before we present the results in Sect. 4. In Sect. 5, we discuss our results and the limitations of our research. We conclude in Sect. 5 by deriving implications for managerial practice.

2 Theoretical Background and Hypotheses

2.1 Reviews as a Public Good

Interactions between reviewers share some characteristics to interactions between individuals in a public good game (PGG).Footnote 8 Reviews are non-excludable and non-rivalrous. Each potential customer can access the reviews for free. A customer who reads a review does not reduce the benefit others can derive from the review. Hence, reviews constitute a public good, and the quality of the public good increases in the quantity and quality of the reviews.

Writing a review generates private costs and benefits others. Not writing a review and saving these costs is the dominant strategy if a reviewer only cares about her own monetary payoff. If all reviewers follow this strategy, no reviews are written. This constitutes a Nash equilibrium because no individual reviewer can increase her expected payoff by providing higher quality reviews. Clearly, this equilibrium is inefficient, because no information is shared. To realize benefits from customer-written reviews, retailers need to motivate their customers to write as many helpful reviews as possible.

2.2 Approval and Disapproval as Nonmonetary Rewards

When modeling review writing as a public good game, theory predicts that no reviews are written if customers only care about their monetary payoff. Obviously, this is at odds with the large number of reviews that are written. One explanation for the positive number of reviews is that review-writers do not care about money alone but also about nonmonetary rewards. More specifically, if helpfulness ratings can be assigned to reviews, these ratings can be used to express approval or disapproval, which are nonmonetary rewards.Footnote 9

This explanation receives support from the public goods literature. The general pattern in laboratory PGGs is that, on average, participants contribute approximately 50% of their endowment and contributions decrease over time (e.g., Zelmer 2003). Contributions can be increased by adding a second stage, in which participants can allocate nonmonetary rewards or punishment points to each other (Masclet et al. 2003; Dugar 2013; Greiff and Paetzel 2015). Nonmonetary rewards and punishments are usually modelled as the expression of approval and disapproval points. In the literature, this is often referred to as the exchange of social approval, peer approval, or the expression of informal sanctions. In the following, we will refer to it as approval or disapproval. A second theoretical argument for the effectiveness of approval and disapproval applies to repeated games. Approval and disapproval can serve as pre-play communication for future rounds (Masclet et al. 2003, p. 367). Although this form of pre-play communication is cheap talk, it is well-known that cheap talk positively affects contributions (Ledyard 1995; Zelmer 2003).Footnote 10 In our experiment, participants can use helpfulness ratings to assign approval and disapproval points about the review written by another participant. This can signal e.g,. that the review is too short and contribution to the public good is insufficient.

Helpfulness ratings provide incentives for review-writing even if helpfulness ratings do not affect the chances of receiving any future monetary reward. However, if good helpfulness ratings increase the chances of receiving a bonus (as in our bonus treatment), this could lead to crowding-out and strategic downvoting, which we discuss next.

2.3 Crowding-Out

A monetary bonus paid for the best review could increase the quality of reviews. However, introducing such a bonus could come with negative psychological side effects, which may weaken the individual’s intrinsic motivation to engage in the incentivized activity (e.g., Deci et al. 1999; Gneezy et al. 2011). These effects are referred to as crowding-out. Crowding-out occurs because the strength of the intrinsic motivation depends on the perception of the activity. In the remainder of this section, we will argue that a change in the incentive scheme will affect perception and intrinsic motivation.Footnote 11

Assuming that reviewers are motivated not only by monetary incentives, we can distinguish between extrinsic and intrinsic motivation. In the case of review writing, extrinsic motivation consists of monetary and nonmonetary rewards (i.e., the helpfulness ratings discussed in the preceding section). In contrast to extrinsic motivation, intrinsic motivation derives from rewards inherent to the activity of review writing.

The strength of intrinsic motivation is not independent from monetary incentives. Consider a setting in which a fixed monetary reward is paid for any review, regardless of the review’s quality. In this setting, writing a high-quality review is likely to result in feelings of generosity and competence. Feelings of self-interest are unlikely because the monetary reward is independent of quality. By writing a better review, the writer provides more information for others but cannot increase her own payment.

This might be different in a setting in which only the best review is rewarded with a monetary bonus. In this setting, writing a high-quality review is less likely to result in feelings of generosity and competence. This is because review writing might now be perceived as driven by pursuit of the bonus. By writing a better review, the writer provides more information for others but at the same time increases her own expected payment. Hence, when quality is incentivized, a high-quality review is more likely to signal the reviewer’s self-interest and can lead to reduced feelings of generosity and competence. Ultimately, this can lead to lower intrinsic motivation as compared to the former setting without a quality-contingent bonus.

2.4 Strategic Downvoting

Whenever helpfulness ratings are assigned solely based on quality, the best review will receive the highest rating. At first glance, rendering rewards directly dependent on quality appears a straightforward and effective idea (Wang et al. 2012). It rests on the assumption that helpfulness ratings are cast honestly.

However, if only the best review is rewarded with a bonus while all other reviews are unpaid, the setting resembles a winner-takes-all-tournament. In such a tournament the winner is determined by relative quality. Quality is assessed by helpfulness ratings and reviewers have an incentive for strategic downvoting: Assigning a lower rating to others is a form of sabotage (see Harbring and Irlenbusch 2011; or section 6.1 in Dechenaux et al. 2015) and increases the chance of getting the highest rating and receiving the bonus.

Strategic downvoting can reduce reviewers’ motivation as well as the signaling power of helpfulness ratings. If helpfulness ratings are a form of approval, as we have argued above, reviewers might expend a lot of time and effort to write a review because this increases the chances that the review will receive a high helpfulness rating. Strategic downvoting might weaken approval because of a lower correlation between helpfulness ratings and quality.Footnote 12 Reviewers will learn about or anticipate strategic downvoting, and helpfulness ratings will lose their motivating power. If reviewers anticipate strategic downvoting, the quality of reviews might deteriorate because of crowding-out effects.

A related issue is the effect of strategic downvoting on the signaling power of ratings. If customers expect strategic downvoting, they cannot distinguish between the most helpful reviews by looking at the helpfulness ratings. This means that customers may end up with basing their decisions on mediocre reviews or may not even use the reviews at all.

2.5 Hypotheses

We focus on review systems in which the quality of reviews is endogenously determined by reviewers’ helpfulness ratings. Our main goal is to investigate the effect of a quality-contingent bonus on the quality of reviews.

Since reviewers’ helpfulness ratings are central to our theoretical reasoning, we start by investigating the effect of a quality-contingent bonus on reviewers’ assignment of helpfulness ratings. Based on the theoretical considerations discussed in Sect. 2.4, we derive the following hypotheses, which we will test using data from a controlled laboratory experiment and an online survey. A critical discussion at the end of the paper will further analyze our assumptions behind our hypotheses.

Hypothesis 1 (H1—Strategic Downvoting)

Incentivizing review quality by introducing a quality-contingent bonus leads to strategic downvoting.

The theoretical background for H1 is based on our discussion of strategic downvoting. If quality is incentivized, reviewers maximize their expected payoff by assigning the lowest helpfulness rating to all other reviews. In order to investigate the effect of a quality-contingent bonus on the quality of reviews, we derive our second hypothesis, H2.

Hypothesis 2 (H2—Crowding-Out)

Incentivizing review quality by introducing a quality-contingent bonus decreases the average quality of reviews.

The theoretical background for H2 is based on our discussion of crowing-out and strategic downvoting and is illustrated in Fig. 1. If the introduction of a quality-contingent bonus leads to crowding-out, intrinsic motivation will decrease and, consequently, the average quality of reviews will decrease. This holds even if helpfulness ratings are assigned honestly.

Fig. 1
figure 1

Theoretical background for Hypothesis 2

In addition to the crowing-out effect, there could be an effect from strategic downvoting. In the extreme case, all reviewers assign the lowest helpfulness rating to each review. This implies that helpfulness ratings are not used to express approval or disapproval, so that the extrinsic incentives from helpfulness ratings are absent. More importantly, strategic downvoting implies that nobody can increase her probability of winning the bonus because helpfulness ratings are statistically independent of review quality, so that the winner is decided by chance alone.Footnote 13 Overall, both effects reduce review quality.

3 Experiment and Survey

3.1 Experimental Design

Each participant plays within a group of 5 players for 4 rounds. Each round consists of two stages (see Fig. 2). Group composition remains constant across rounds.

Fig. 2
figure 2

Experimental design

In the first stage of each round, participants receive a product and are given the opportunity to sample it. All participants receive the same product, and, in each round, they receive a new product. Products are shown in Fig. 3.

Fig. 3
figure 3

Sample products in the experiment. (1 hand balm, 2 tooth brush, 3 key ring, 4 pocket calculator. Prices are displayed below the items)

Products are distributed at the beginning of each round and collected after each round. We use inexpensive everyday experience products in order to (i) minimize the effect of differences in participants’ product (type) expertise and (ii) make sure that participants were able to quickly familiarize themselves with the products. Products are always presented in the same order to avoid product order-related confounding. We supplied participants with real products and designed the computer interface similar to existing websites where customers review products. These choices were made so that participants could be expected to be familiar with the environment. Also, instructions were framed to put participants in the shoes of real reviewers.

After sampling the product (without a time limit), participants have to write a product review with 0–400 characters. In addition to the cognitive effort, writing the review is costly in terms of money. For each character written, a participant’s payoff is reduced by € 0.004. If a participant writes a review of maximum length (400 characters), € 1.60 are deducted from her endowment. To avoid losses, participants are endowed with € 1.60 per round.

Implementing a fixed cost per character written increases the opportunity cost of review writing. This feature of the design ensures that writing a review is costly, even if costs of real effort (which are unobservable) are close to zero. Moreover, it ensures that participants have no incentive to write uninformative reviews.

Providing reviews generates a benefit for others. We proxy this positive externality by the number of characters. To account for the positive externality, a participant’s payoff increases by € 0.005 for each character written by another group member. Note that participants only profit from reviews written by others. Thus, participant i’s payoff from stage 1 is given by:

$$\pi _{i}=1.60-0.004 c_{i}+0.005 c_{-i},$$

where ci denotes the number of characters written by participant i and c−i denotes the number of characters written by all other participants \(-i\) in the same group. This incentive scheme resembles a public goods dilemma. In the standard PGG, the public good is given by the sum of all participants’ contributions multiplied by a positive constant (the so-called marginal per capita return). In our setting, a participant’s own contribution does not increase the public good, i.e., it affects only the other participants’ payoffs but not the contributor’s payoff. Similar to a PGG, the social benefit of a contribution (€ 0.005 for each participant) exceeds the private benefit.

In addition to writing the review, we gathered information about participants’ product-specific preferences. Participants had to evaluate the product (numerical product score from 1 = very good to 6 = very bad) and were asked to specify their willingness to pay (WTP). If the WTP exceeded the price of the product (which was unknown to participants), they had to buy the product at this price.Footnote 14 The fact that participants did effectively purchase the product (if WTP > price) increases the experiment’s realism and external validity. WTP and product score remained private information.

In the second stage, participants were presented with the reviews of all other participants in their group and asked to rate each review’s helpfulness on a five-point scale (5 = very helpful, 1 = not helpful at all).Footnote 15 So, in our experiment, all participants act both as reviewers and as readers (we discuss this in Sect. 5). We used different scales and input formats for product evaluation in stage 1 and helpfulness rating in stage 2 to avoid confusion among participants.

Using a between-subjects design, we compare behavior across two treatments. Therefore, participants’ payoff in stage 2 is treatment specific. In our first treatment, each participant receives a flat salary of € 1 (flat wage treatment FWT)Footnote 16. In the second treatment, the reviewer with the highest average helpfulness rating receives a bonus of € 5 while all other group members receive no payment (bonus treatment BT). In case several reviews attain identical helpfulness ratings, the bonus was split evenly between these reviewers. At the end of each round, each participant was informed about her payoff and the average helpfulness rating for her review.

The total payoff of each round is given by the sum of payoffs from each stage. That is, total payoff is composed of (i) the payoff \(\pi _{i}\) from stage 1 (see the formula above) minus the price of the product if the participant bought the product, and (ii) the payoff from stage 2.

The Nash equilibrium for this game would be to assign the lowest helpfulness rating to every review in BT, while there would be indifference on this decision in FWT. In BT, all reviews would be rated with the lowest helpfulness rating so that the bonus would be split equally among the group of five, yielding the identical payoff as in FWT. Knowing that the length of the review does not influence the chance of getting the bonus would induce participants to write no review in the first stage. As the game is finite, backward induction translates this result from the last round to all previous ones. In this case, participants would earn € 1.60 from stage 1 and € 1 from stage 2, yielding a total payoff of € 2.60 per round (minus the expenditure for buying products).

After the fourth round, participants were asked to complete a questionnaire on demographic variables, product reviewing experience and review usage. The total payoff earnt in the experiment is given by the sum of all rounds’ payoffs.

3.2 Survey Design

According to H1, helpfulness ratings, as expressed by participants in treatment BT, may be biased downward due to strategic downvoting. In order to gather unbiased helpfulness ratings, we complemented our experiment by a survey with different participants than in the experiment.

In the survey, participants were asked to rate the helpfulness of reviews written in the experiment. More precisely, we randomly selected the reviews of 20 participants from the experiment (10 from each treatment). The 80 reviews written by these experiment participants (4 reviews written by each participant) were then rated by the survey participants.Footnote 17

3.3 Experimental Procedure

The experiment was conducted at the Passau University Experimental Laboratory (PAULA) using classEx (Giamattei and Lamsbdorff 2019). Upon arrival, participants were randomly seated in the laboratory and given detailed experimental instructions.Footnote 18 A pre-test and several control questions ensured that participants understood the instructions correctly.Footnote 19

We conducted in 6 sessions with 90 participantsFootnote 20 in 18 groups (8 FWT; 10 BT). 74% of the participants were female resembling the composition of students in Passau. 13% studied an economics-related major and the average age was 23.3. The experiment lasted on average 91.24 min (sd 16.19). The average payoff of € 14.37 (including the show-up fee of € 2) for around 90 min of work is slightly above an average student salary in Passau at that time.

In the survey, we ensured that the 96 survey participants had not taken part in the experiment. Each participant rated up to 20 reviews.Footnote 21 As we recruited students from big lectures, we awarded three gift vouchers with a value of € 10 each. Participants needed 10 min on average to complete the survey. In sum, the survey resulted in 1781 helpfulness ratings, 420 for reviews from FWT, 1361 for reviews from BT. On average, each review from FWT (BT) was rated by about 10 (34) survey participants. We collected more helpfulness ratings for BT as we expected downvoting to be more prominent in this treatment. For our statistical analysis, we will use the average helpfulness ratings from survey participants for the 80 reviews (40 from each treatment, see Sect. 3.2).

4 Results

4.1 Strategic Downvoting and Review Quality

The analysis in this section focuses on three variables: review length as a proxy for qualityFootnote 22, the helpfulness ratings from the experiment, and the helpfulness ratings from the survey.

Fig. 4 summarizes the pooled data. We have four observations per participant (one observation for each round), resulting in 4 × 40 = 160 observations in FWT and 4 × 49 = 196 observations in BT. In BT, reviews are longer, but receive lower helpfulness ratings. The presumably unbiased helpfulness ratings from the survey do not indicate that BT reviews are less helpful. This is an indicator of strategic downvoting in the experiment.

Fig. 4
figure 4

Mean values for review length, helpfulness rating from the experiment, and helpfulness rating from the survey (P-values in graphs are from two-sided Mann-Whitney tests. Data are shown for all four rounds)

First, we compare helpfulness ratings in the experiment. In BT, the mean is lower (2.02 in comparison to 2.91 in FWT, two-sided Mann-Whitney test, z = 8.001, p = 0.0000).Footnote 23

The comparison of these means, however, does not consider the quality of the reviews. Lower helpfulness ratings in BT might be justified if the review quality is lower. This is unlikely because reviews are significantly longer (216 characters in FWT and 302 in BT, two-sided Mann-Whitney test, z = −3.202, p = 0.0014). More importantly, we can confront this with the data from the survey. For each treatment, we have 40 randomly selected reviews which were evaluated by external survey participants (see Sect. 3.2). Judged by the unbiased reviews from the survey, average review quality is higher in BT (3.17 in FWT and 3.56 in BT, z = −2.156, p = 0.0311, two-sided Mann-Whitney test). Hence, we can exclude the possibility that lower helpfulness ratings in BT are driven by the quality of the reviews.

4.1.1 Result 1

There is strategic downvoting in treatment BT. We find clear evidence for Hypothesis H1.

Fig. 5 illustrates Result 1 graphically by plotting the mean helpfulness ratings each review received in the experiment against the rating it received in the survey. Points above (below) the 45-degree line indicate reviews for which the helpfulness rating in the survey was lower (higher) than the helpfulness rating in the experiment. In FWT, all points are distributed around the 45-degree line: 18 points lie above, 19 below, and 3 exactly on the 45-degree line. This is very different in BT, where only 3 points lie above, and 37 points lie below the 45-degree line.

Fig. 5
figure 5

Average helpfulness ratings in the experiment and in the survey by treatment (Each review is represented as a point. The point’s y‑coordinate is given by the mean of all average helpfulness ratings received by the specific review in the experiment. A point’s x‑coordinate is given by the average of all helpfulness ratings received by the specific review in the survey. Horizontal and vertical lines represent mean values based on the average helpfulness ratings assigned to all reviews that were rated both by participants of the experiment and by survey participants)

To further validate result 1, we run a linear regressionFootnote 24 of the helpfulness rating on review length, an indicator variable for BT, and the interaction between both independent variables (see first column in Table 1). The review length positively impacts the helpfulness rating. The coefficient for the indicator variable for treatment BT is negative and significant. Also, the coefficient for the interaction term (BT * review length) is negative and significant. Hence, an increase in review quality leads to an increase in the helpfulness rating, but the increase is smaller in BT. In the second regression in Table 1, we add controls for rounds (taking round 1 as baseline). Note that the indicator variables for rounds do not only capture the effect of learning, but also individual product characteristics, as participants rated a different product each round. Our results are robust to the inclusion of round effects.

Table 1 OLS regressions with average helpfulness rating as dependent variable

The evidence described in the previous paragraphs indicates that helpfulness ratings differ between treatments. Helpfulness ratings expressed by participants in BT are clearly biased downwards. Hence, they do not reflect the true underlying quality but are driven by strategic downvoting.Footnote 25

As we cannot reject the H1 (strategic downvoting), we analyze the evidence related to the quality of the reviews. The comparison of helpfulness ratings from the survey reveals that review quality is higher in BT (see above). Contrary to our expectations, the data does not provide support for H2.

4.1.2 Result 2

Despite strategic downvoting, the average quality of reviews is higher in treatment BT.

If the quality of reviews is not affected by strategic downvoting, the question remains which factors do influence the quality of reviews. To shed some light on this question, we performed Tobit regressions with review length as the dependent variable (see Table 2).

Table 2 Since review length is truncated in the interval [0, 400], we performed Tobit regressions with review length as dependent variable

Higher helpfulness ratings in a given round increase review quality in the following round. This positive correlation indicates that helpfulness ratings may act as approval, similar to Masclet et al. (2003), Dugar (2013) and Greiff and Paetzel (2015).

A similar effect arises due to the dynamics within a group. If all other group members provide high-quality reviews, participants also increase the quality of their reviews. These self-reinforcing effects are similar to the coordinating effect of high contributions often observed in public good games (e.g., Weimann 1994).

We also examined the influence of winning the bonus on review behavior in the round following the win. Winning the bonus had a negative but non-significant effect on review quality.

We included controls for the product score and the willingness to pay (WTP). Only in FWT, the score has a weakly significant effect, indicating that in this treatment, participants who perceive the product as better tend to write longer reviews.

Controlling for gender and economics-related major showed that reviews written by female participants were longer in FWT but not in BT. Participants with an economics-related major wrote shorter reviews in FWT. But effects are only weakly significant.

4.2 Behavior Over Time

Fig. 6 depicts our three main variables by round. These figures provide support for an additional Result 3.

Fig. 6
figure 6

Mean values and 95% confidence intervals by round

4.2.1 Result 3

Over time, review quality decreases in FWT but not in BT. Strategic downvoting becomes more severe over time.

Similarly to repeated public goods games, in FWT, we observe that the number of characters written decreases over time. In BT, the length of reviews does not decrease but stays constant. It seems that the existence of the bonus prevents a decrease in review quality. Participants do not shy away from writing long and qualitative reviews in spite of their reviews being voted down strategically. Learning does not play a role here as we find these long reviews until round 4 in BT. In Table 2, we report dummies for later rounds. Only for FWT they are overall negative and one of them gets significant while for the BT they are not significant at all. This support the results found in Fig. 5Footnote 26.

The pattern is mirrored if we look at helpfulness ratings from the survey. In FWT, these ratings show a downward trend while in BT they stay constant. To the contrary, the helpfulness ratings in the experiment show a very different pattern. While in FWT they correlate with the length of the review, in BT the helpfulness ratings sharply decrease over time instead of the length of reviews being high and constant, which again confirms our hypothesis on strategic downvoting.

5 Discussion and Conclusion

Both quantity and quality of customer-written product reviews have positive impacts on purchase intentions, sales, customer satisfaction, and welfare. Retailers and online platforms use different incentives to increase review quantity and quality. In our experiment, we focus on review quality and compare two different incentive schemes. Under the first incentive scheme, reviewers receive a flat salary per review, independent of the review’s quality. Under the second incentive scheme, only the reviewer who wrote the highest-quality review receives a bonus.

Under both incentive schemes, helpfulness ratings are assigned by the other reviewers. Theory predicts that the bonus will lead to strategic downvoting. Reviewers will assign low helpfulness ratings to reviews written by others, because this maximizes the chances of winning the bonus. If reviewers anticipate strategic downvoting, the quality of reviews might deteriorate because of crowding-out effects and because helpfulness ratings do not express approval.

Our data shows that the quality-contingent bonus indeed leads to strategic downvoting. Although the data provides clear evidence for strategic downvoting, the bonus does not have a negative effect on review quality. Review quality remains constant in the presence of the bonus scheme but decreases over time when reviewers receive a flat salary.

We chose our two incentive schemes such that the expected monetary payoffs are identical. This allows us to rule out differences in expected payoffs as an explanation for the observed differences in review quality. Given that most retailers do not reward reviews with a fixed payment and most reviewers are paid nothing, we expect the difference in review quality to be even larger in the real world.Footnote 27

In our experiment, all participants write reviews and rate the quality of others’ reviews. In reality, there are four mutually exclusive roles: customers who write reviews and vote on the quality of reviews (“reviewers” as in our experiment), customers who write reviews and do not vote on the quality of reviews, customers who do not write reviews but vote on the quality of reviews (“voters”), and silent customers who neither write reviews nor judge the reviews’ quality (“silent customers”). Only “reviewers” have an incentive for strategic downvoting, but they are the minority, which may raise the concern that this could change our predictions. If the majority of votes are cast non-strategically, the impact of strategic downvoting may be quite small and strategic downvoting might not be a problem. However, this is not the case. Consider two reviews written by “reviewers”, who compete for a bonus given to the most helpful review. Assuming that the votes cast by “voters” are unbiased, both reviews will receive the same number of positive and negative helpfulness ratings from “voters”. The helpfulness ratings assigned by “reviewers” themselves will be decisive for determining who gets the bonus. The impact of strategic downvoting might be a problem in reality, even though “reviewers” are the minority.

Our results have direct managerial implications for retailers. First, tournament incentivize schemes have no adverse effects on review quality. This suggests that these incentive schemes increase the amount of pre-purchase information. Because only a small number of products receive professional reviews (e.g., reviews in big newspapers) but a much larger number of products are reviewed by customers, employing tournament incentive schemes to generate pre-purchase information is advisable, especially for retailers selling niche products. In addition, our results show that reviewers maintain the high quality even over several rounds. Note, however, that in reality, a tournament incentive schemes could reduce the quantity of reviews written (which was fixed in our experiment), so that there is a tradeoff between the positive effect on quality and the negative effect on quantity.

Second, when determining which reviewers will receive a bonus, these retailers have to respect the fact that helpfulness ratings could be biased. It is therefore desirable that this bias is mitigated. Instead of using the arithmetic average for aggregating helpfulness ratings, retailers could switch to alternative approaches which assign less weight to helpfulness ratings from reviewers which are likely to be motivated to downvote other reviewers. Possibly, one could employ statistical methods or machine learning to estimate the size of the bias, and use then this data to devise an aggregation procedure which captures the maximum of information from helpfulness ratings (see also Dai et al. 2018).

Third, a closely related problem is the loss in signaling power, which arises from strategic downvoting. If customers cannot rely on helpfulness ratings to help them find high-quality reviews, their search and evaluation costs will increase if they realize this bias. If not, they base their decision on inferior reviews and may regret their purchase decision. Retailers can counteract the loss in signaling power because only reviewers have an incentive for strategic downvoting. “Voters”, who never write reviews, have no incentive to do so. By focusing on helpfulness ratings assigned by these customers, retailers can identify the reviews that are most helpful (based on unbiased ratings). In contrast to the machine learning approach above, the disadvantage may be to lose votes, which is especially problematic when new products are launched, and only little votes are gathered. Another option is therefore not to exclude but to mark those helpfulness ratings which came from other reviewers as such.

A fourth implication concerns the relation between the problem of obtaining high-quality reviews and the provision of public goods. In both cases, the benefits are publicly available while costs are private. Our study indicates that a monetary bonus given to the participant who made the highest contribution increases efficiency, even though the “best” contributor is determined endogeneously, which could give rise to strategic downvoting. However, caution should be exercised when generalizing the results from our study to other public good style situations. We have analyzed a market for reviews where each and every review receives exactly the same number of helpfulness ratings from all other participants. Moreover, there are no opportunity costs of assigning helpfulness ratings. Because of these differences, our study might not adequately capture many features of “real-world” public good style situations. It would be worthwhile to analyze how the presence of opportunity costs affect the assignment of helpfulness ratings and consequently, contributions. A further limitation of this study is the short-run examination of review behavior. It remains unclear whether a bonus still has no negative effect on review quality when review behavior is observed over an extended period of time. These aspects are beyond the scope of our paper, which had the more modest goal of identifying whether a monetary bonus affects the assignment of helpfulness ratings and participants’ review writing behavior.

The results derived from this study open up new and interesting questions for future research on how different incentive schemes affect review quality. Future research could use more complex experimental designs to analyze the changes discussed in the previous paragraph. In addition, field data could be used to shed some light on the size of the “downvoting” bias in existing review systems and could develop and test alternative helpfulness-based incentives that do not give rise to strategic downvoting.