The Market for Reviews: Strategic Behavior of Online Product Reviewers with Monetary Incentives

Customer reviews reduce search cost and uncertainty about a product's quality. Hence, the quantity and quality of reviews has positive impacts on purchase intentions, sales, and customer satisfaction. In order to increase review quality, retailers and online platforms employ different monetary incentives. We experimentally compare two different incentive schemes: an incentive scheme in which reviewers receive a flat salary, which is independent of review quality, and an incentive scheme in which the reviewer who wrote the highest-quality review receives a bonus payment. Under both incentive schemes, review quality is determined through helpfulness ratings, which are assigned by the other reviewers. Adverse consequences arise under the bonus scheme. Strategic considerations give rise to strategic downvoting, so that reviewers assign low helpfulness ratings to others' reviews in order to maximize their expected payoffs.. Review writing behavior remains unaffected: quality-contingent bonus payments do not affect review quality. However, they do destroy the signaling power of helpfulness ratings.


Introduction
Nearly unlimited choice and pre-purchase uncertainty regarding a product's quality has sparked considerable interest in developing ways of supporting customers' decision making in online shopping (e.g., Brynjolfsson, Hu and Smith 2003, Gill et al. 2012, Dimoka et al. 2012. To reduce uncertainty, retailers provide manufacturer-independent, customer-written reviews on their websites. Customer reviews are trusted (Bickart and Schindler 2001), increase sales and purchase intentions (Berger et al. 2010;Ghose and Ipeirotis 2011;Park et al. 2007;Chen et al. 2010; for a survey, see Dellarocas 2003), increase post-purchase satisfaction (Stephen et al. 2012), reduce return rates (Sahoo et al. 2018) and increase perceived website usefulness (Kumar and Benbasat 2006). Several companies have even built successful businesses around soliciting, collecting and distributing customer reviews to online retailers. 1 Customers do not, however, consider all reviews but instead search for high-quality and trustworthy information (Chen et al. 2008). Only high-quality reviews will be perceived as helpful and will affect customers' purchasing decisions (Pavlou et al. 2007). Hence, any successful review system requires that enough high-quality reviews are available for each product. This requirement has led to different incentive schemes, all trying to induce customers to write as many high-quality reviews as possible. In this paper, we compare two different often used incentive schemes: a pay-per-review scheme, in which payment is independent of quality, and a scheme in which reviewers receive a quality-contingent bonus.
Several retailers run a pay-per-review scheme. Customers receive a fixed payment or a product in exchange for a review. Some review platforms (e.g., www.ilovetoreview.com) connect retailers, who are willing to give away their products for free in exchange for a review, and customers, who agree to write a review in exchange for the product. This could be problematic because, if the costs of writing a review increase with review quality, a money-maximizing customer will write a large number of low-quality reviews.
To generate high-quality reviews, some retailers condition payment on review quality. In order to judge a review's quality, retailers look at the helpfulness votes, which are assigned by customers. In its incentive program "Vine Club", a selected group of Amazon customers receive pre-release products free of charge in exchange for a review. The specific eligibility criteria for Vine reviewers are not publicized, but Amazon admits that the helpfulness of previously written reviews plays a large role. Essentially, "Vine Club" is a bonus scheme in which the bonus consists of being admitted to and remaining in the Vine program. Such an incentive scheme can give rise to strategic downvoting, as the following example illustrates.
Soon after the initiation of Amazon's Vine program, Vine members started to complain that their reviews were accumulating inexplicably high numbers of negative helpfulness ratings. They suspected that fellow reviewers were systematically "voting down" their reviews to oust and replace them as Vine Club members, or to protect their own membership status. 2 Here, an implicit assumption is that a review's quality is measured in relative terms. By assigning a bad helpfulness rating to a fellow's review, the reviewer's own review will be perceived as better relative to the fellow's review. 3 Because the incentive to assign the lowest possible helpfulness rating is strategic (retaining VINE membership), we call such behavior strategic downvoting. 4 This example reveals a potential weakness of quality-based monetary incentives: Due to strategic downvoting, helpfulness ratings may be biased. Possibly, this could discourage reviewers and lead to a decrease in review quantity and quality. It may also imply that a review's helpfulness rating ceases to signal the review's quality to other customers. These problems exist not only if most customers are motivated by quality-based monetary incentives. But it may also be relevant if there is a large population of silent customers, who never write reviews, as we will argue in section 5.
The empirical evidence so far is inconclusive. Closest to our study are Stephen et al. (2012) and Wang et al. (2012). Stephen et al. (2012) show that a flat salary ($ 1 per review) increases the helpfulness of the reviews. 5 Wang et al. (2012) find no effect of a quality-contingent payment ($ 0.25 per helpfulness point) on review helpfulness. A systematic comparison of a flat salary and a bonus incentive scheme has, to the best of our knowledge, not been conducted. Our paper tries to fill this gap.
Using a controlled laboratory experiment and an online survey, we compare a pay-perreview scheme and a quality-contingent bonus scheme. Participants write product reviews and vote about the helpfulness of others' reviews. Writing reviews is implemented as a public good setting where the whole group profits from written reviews. We compare how the different incentive schemes affect review quality and the assignment of helpfulness ratings. The experiment allows us to account for confounding factors such as collusion or social pressure, which are present in the field. Most importantly, the random assignment to treatments implies that differences in review quality cannot be explained by differences in reviewers' experience or reliability. The exogenous randomization in experimental settings provides a clear methodological advantage over using field data (Falk andHeckman 2009, Bardsley et al. 2010).
The remainder of the paper is organized as follows. We discuss the theoretical background and derive our hypotheses in section 2. In section 3, we introduce the designs of experiment and survey, before we present the results in section 4. In section 5, we discuss our results and the limitations of our research. Finally, we conclude in section 6 by deriving implications for managerial practice.

Review Quality as a Public Good
Interactions between reviewers share some characteristics to interactions between individuals in a public good game (PGG). In a typical PGG, each player contributes private resources to a public good, which benefits the group. For each player, the cost of a contribution exceeds her private benefit but the benefit for all group members exceeds her cost. Players maximize their own payoff by contributing zero. However, each group member would receive a higher payoff if every player made a positive contribution. 6 In a group of reviewers, each reviewer writes a review, which benefits all other group members. Writing a review generates private costs and benefits others. Not writing a review and saving these costs is the dominant strategy if a reviewer only cares about her own monetary payoff. If all reviewers follow this strategy, no reviews are written. This constitutes a Nash equilibrium because no individual reviewer can increase her expected payoff by providing higher quality reviews. Clearly, this equilibrium is inefficient, because no information is shared. To realize benefits from customer-written reviews, retailers need to motivate their customers to write as many helpful reviews as possible.

Approval and Disapproval as Nonmonetary Rewards
The general pattern in laboratory PGGs is that, on average, participants contribute approximately 50% of their endowment and contributions decrease over time (e.g. Zelmer 2003). Contributions can be increased by adding a second stage, in which participants can allocate rewards or punish each other. Rewards and punishment can be monetary (e.g., Fehr and Gächter 2000, Masclet et al. 2003, Sefton et al. 2007 or nonmonetary (Masclet et al. 2003, Dugar 2013, Greiff and Paetzel 2015. Nonmonetary rewards and punishments are usually modelled as the expression of approval and disapproval points. In the literature, this is often referred to as the exchange of social approval, peer approval, or the expression of informal sanctions. In the following, we will refer to it as approval or disapproval and argue that a review's helpfulness rating can have a similar effect. From a theoretical perspective, approval and disapproval can affect contributions if approval and disapproval are arguments in a player's utility function. Masclet et al. (2003) show that in a PGG with homogeneous endowments, approval increases average contributions. Greiff and Paetzel (2015) show that in a PGG with heterogeneous endowments, approval increases average contributions.
A second theoretical argument for the effectiveness of approval and disapproval applies to repeated games. Approval and disapproval can serve as pre-play communication for future rounds (Masclet et al. 2003, p. 367). Although this form of pre-play communication is cheap talk, it is well-known that cheap talk positively affects contributions (Ledyard 1995, Zelmer 2003. Taking this into account, one would expect the effect of approval to be stronger in repeated interactions, which is in line with Masclet et al. (2003). 7 In our experiment, participants can use helpfulness ratings to assign approval and disapproval points about the review written by another participant. This can signal e.g. that the review is too short and contribution to the public good is insufficient. Based on the previous discussion, we expect to see a relation between helpfulness ratings in one round, and the quality of reviews in the succeeding round. Monetary incentives for high quality reviews can affect this relation because they can lead to crowding-out and strategic downvoting, which we discuss next.

Crowding-Out
Monetary incentives may increase contributions. On the one hand, there is a direct price effect because the incentivized behavior becomes more attractive. Being paid to write a review makes review-writing more attractive. However, there might also be negative effects (e.g., Gneezy et al. 2011).
On the other hand, there are psychological side effects, which may strengthen or weaken the individual's intrinsic motivation to engage in the incentivized activity. These effects are referred to as crowding-in and crowding-out. If the crowding-out effect exceeds the price effect, the overall effect will be negative, meaning that the monetary incentive makes the incentivized activity less attractive. 8 7 In addition, the experiment reported in Gächter and Fehr (1999) reveal that in repeated PGGs with partner matching, the effect of approval is strongest if group identity is established before the game. Dugar (2013) compares the effect of approval and disapproval in a PGG with partner matching. He finds that disapproval points have a larger effect than approval points, but that the effect on contributions is largest when participants are allowed to choose between approval and disapproval. There is also some evidence for antisocial punishment (see Herrmann et al. 2008), which means that high contributors are punished.
Crowding-out occurs because the strength of the intrinsic motivation depends on the perception of the activity. Introducing a monetary incentive affects perception. The incentivized behavior becomes less attractive because it loses signaling power as a voluntary task. Without the monetary incentive, high-quality reviews are likely to be driven by the desire to signal competence, generosity, or the reviewer's adherence to a non-market norm of review writing. When quality is incentivized, a high-quality review is more likely to signal the reviewer's self-interest and can lead to reduced feelings of selfdetermination and competence. Ultimately, it can negatively affect intrinsic motivation and lead to crowding-out.

Strategic Downvoting
Whenever helpfulness ratings are assigned solely based on quality, the best review will receive the highest rating. At first glance, rendering rewards directly dependent on quality appears a straightforward and effective idea (Wang et al. 2012). It rests on the assumption that helpfulness ratings are cast honestly.
However, if the best review is rewarded with a bonus, reviewers have an incentive for strategic downvoting: Assigning a lower rating to others increases the chance of getting the highest rating and receiving the bonus.
Strategic downvoting can reduce reviewers' motivation as well as the signaling power of helpfulness ratings. If helpfulness ratings are a form of approval, as we have argued above, reviewers might expend a lot of time and effort to write a review because this increases the chances that the review will receive a high helpfulness rating. Strategic downvoting might weaken approval because of a lower correlation between helpfulness ratings and quality. 9 Reviewers will learn about or anticipate strategic downvoting, and helpfulness ratings will lose their motivating power. If reviewers anticipate strategic downvoting, the quality of reviews might deteriorate because of crowding-out effects. Hence, review quality might decrease.
A related issue is the effect of strategic downvoting on the signaling power of ratings. If strategic downvoting occurs, customers cannot distinguish between the most helpful reviews by looking at the helpfulness ratings, which means that customers may end up with basing their decisions on mediocre reviews which were worse, but not strategically downvoted.

Hypotheses
We focus on review systems in which the quality of reviews is endogenously determined by reviewers' helpfulness ratings. Our main goal is to investigate the effect of a quality-contingent bonus on the quality of reviews. Since reviewers' helpfulness ratings are central to our theoretical reasoning, we also investigate the effect of the bonus on reviewers' assignment of helpfulness ratings. Based on the theoretical considerations discussed above, we derive the following hypotheses, which we will test using data from a controlled laboratory experiment and an online survey. A critical discussion at the end of the paper will further analyze our assumptions underneath the hypotheses.
The theoretical background for H1 is based on our discussion of strategic downvoting. If quality is incentivized, reviewers maximize their expected payoff by assigning the lowest helpfulness rating to all other reviews. If H1 holds and strategic downvoting occurs, we may expect:

Hypothesis 2 (H2 -Crowding-Out): Incentivizing review quality by introducing a qualitycontingent bonus decreases the average quality of reviews.
In the extreme case, all reviewers assign the lowest helpfulness rating to each review. Consequently, helpfulness ratings are statistically independent of review quality and the extrinsic motivation to write reviews is low. In addition, crowding-out may affect intrinsic motivation as well due to the monetary nature of the incentive. Average review quality will be lower when quality is incentivized, compared to a review system without incentives for quality. This is due to crowding-out and the loss in motivating power of helpfulness ratings.

Experimental Design
Each participant plays within a group of 5 players for 4 rounds. Each round consists of two stages (see Figure 1). Group composition remains constant across rounds.

BT (bonus treatment)
-the participant with the highest helpfulness rating in her group receives a bonus (€ 5) In the first stage of each round, participants receive a product and are given the opportunity to sample it. All participants receive the same product, and, in each round, they receive a new product. Products are shown in Figure 2.
Products are distributed at the beginning of each round and collected after each round. We use inexpensive everyday experience products in order to i) minimize the effect of differences in participants' product (type) expertise and ii) make sure that participants were able to quickly familiarize themselves with the products. Products are always presented in the same order to avoid product order-related confounding. We supplied participants with real products and designed the computer interface similar to existing websites where customers review products. These choices were made so that participants could be expected to be familiar with the environment. Also instructions were framed to put participants in the shoes of real reviewers. After sampling the product (without a time limit), participants have to write a product review with 0-400 characters. In addition to the cognitive effort, writing the review is costly in terms of money. For each character written, a participant's payoff is reduced by €0.004. If a participant writes a review of maximum-length (400 characters), €1.60 are deducted from her endowment. To avoid losses, participants are endowed with €1.60 per round.
Implementing a fixed cost per character written increases the opportunity cost of review writing. This feature of the design ensures that writing a review is costly, even if costs of real effort (which are unobservable) are close to zero. Moreover, it ensures that participants have no incentive to write uninformative reviews.
Providing reviews generates a benefit for others. The size of this positive externality depends on the quality of the review, which we proxy by the number of characters. Mudambi and Schuff (2010) have shown that review length has a positive effect on quality, because longer reviews include more detailed descriptions of the product. To account for the positive externality, a participant's payoff increases by €0.005 for each character written by another group member. Note that participants only profit from reviews written by others. Thus, participant i's payoff from stage 1 is given by: where c i denotes the number of characters written by participant i and c -i denotes the number of characters written by all other participants − in the same group. This incentive scheme resembles a public goods dilemma. In the standard PGG, the public good is given by the sum of all participants' contributions multiplied by a positive constant (the so-called marginal per capita return). In our setting, a participant's own contribution does not increase the public good, i.e., it affects only the other participants' payoffs but not the contributor's payoff. Similar to a PGG, the social benefit of a contribution (€ 0.005 for each participant) exceeds the private benefit.
In addition to writing the review, we gathered information about participants' productspecific preferences. Participants had to evaluate the product (numerical product score from 1 = very good to 6 = very bad) and were asked to specify their willingness to pay (WTP). If the WTP exceeded the price of the product (which was unknown to participants), they had to buy the product at this price. 10 WTP and product score remained private information.
In the second stage, participants were presented with the reviews of all other participants in their group and asked to rate each review's helpfulness on a five-point scale (5 = very helpful, 1= not helpful at all). 11 Participants rated the reviews with the help of a 5-point star rating .We used different scales and input formats for product evaluation in stage 1 and helpfulness rating in stage 2 to avoid confusion among participants. So, in our experiment, all participants act both as reviewers and as readers (we discuss this in section 4).
Using a between-subjects design, we compare behavior across two treatments. Therefore, participants' payoff in stage 2 is treatment-specific. In our first treatment, each participant receives a flat salary of € 1 (flat wage treatment FWT). In the second treatment, the reviewer with the highest average helpfulness rating receives a bonus of € 5 while all other group members receive no payment (bonus treatment BT). In case several reviews attain identical helpfulness ratings, the bonus was split evenly between these reviewers. At the end of each round, each participant was informed about her payoff and the average helpfulness rating for her review.
The total payoff of each round is given by the sum of payoffs from each stage. That is, total payoff is composed of (i) the payoff from stage 1 (see the formula above) minus the price of the product if the participant bought the product, and (ii) the payoff from stage 2.
The Nash equilibrium for this game would be to assign the lowest helpfulness rating to every review in BT, while there would be indifference on this decision in FWT. In BT, all reviews would be rated with the lowest helpfulness rating so that the bonus would be split equally among the group of five, yielding the identical payoff as in FWT. Knowing that the length of the review does not influence the chance of getting the bonus, would induce participants to write no review in the first stage. As the game is finite, backward induction translates this results from the last stages to all previous ones. In this case, participants would earn € 1.60 from stage 1 and € 1 from stage 2, yielding a total payoff of € 2.60 per round (minus the expenditure for buying products).
After the fourth round, participants were asked to fill in a questionnaire on demographic variables, product reviewing experience and review usage. The total payoff earnt in the experiment is given by the sum of all rounds' payoffs.

Survey Design
According to H1, helpfulness ratings, as expressed by participants in treatment BT, are biased downward due to strategic downvoting. In order to gather unbiased helpfulness ratings, we complemented our experiment by a survey with different participants than in the experiment.
In the survey, participants were asked to rate the helpfulness of reviews written in the experiment. More precisely, we randomly selected the reviews of 20 participants from the experiment (10 from each treatment). The 80 reviews written by these experiment participants (4 reviews written by each participant) were then rated by the survey participants. 12

Experimental Procedure
The experiment was conducted at the Passau University Experimental Laboratory (PAULA) using classEx (Giamattei and Lambsdorff 2016). Upon arrival, participants were randomly seated in the laboratory and given detailed experimental instructions. 13 A pre-test and several control questions ensured that participants understood the instructions correctly. 14 We conducted in 6 sessions with 90 participants 15 in 18 groups (8 FWT; 10 BT). 74% of the participants were female resembling the composition of students in Passau. 13% studied an economics-related major and the average age was 23.3. The experiment lasted between on average 91.24 minutes (sd 16.19). The average payoff of € 14.37 for around 90 minutes of work is slightly above an average student salary in Passau at that time. The minimum show-up fee was € 2.
In the survey, we ensured that the 96 survey participants had not taken part in the experiment. Each participant rated up to 20 reviews. 16 Participants received a flat salary of XX?. In sum, the survey resulted in 1781 helpfulness ratings, 420 for reviews from FWT, 1361 for reviews from BT. On average, each review from FWT (BT) was rated by about 10 (34) survey participants. We collected more helpfulness ratings for BT as we expected downvoting to be more prominent in this treatment.

Strategic Downvoting and Review Quality
The analysis in this section focuses on three variables: review length as a proxy for quality, the helpfulness ratings from the experiment, and the helpfulness ratings from the survey. Figure 3 summarizes the data. In BT, reviews are longer, but receive lower helpfulness ratings. The presumably unbiased helpfulness ratings from the survey do not indicate that BT reviews are less helpful. This is an indicator of strategic downvoting.
First, we compare helpfulness ratings in the experiment. We use data from the first round only because then, observations are independent (40 obs. for FWT, 49 obs. for BT). In BT, the mean is lower (2.56 in comparison to 3.04 in FWT, two-sided Mann-Whitney test, z=2.972, p=0.0030).
14 We conducted a pre-test with 10 participants each. Both treatments and the questionnaire were tested twice. Pre-test participants were asked to write down suggestions for improvement and requests for clarification of the experimental procedure during the experiment. We implemented the suggestions and found after the second pre-test that all participants had correctly understood the experimental tasks and had not experienced any problems in carrying them out. We also used the pre-tests to calibrate the parameter settings for the maximum number of characters per review, the "exchange rate" for characters and €, and the incentive payment. Participants were thus able to earn a reasonable hourly wage. Our results show that their behavior was not driven by the desire to minimize unpaid time but that they expended real effort on the experimental tasks. 15 One participant closed the browser after two rounds, could not further participate in the experiment and was therefore excluded from the analysis. After the participant quit the experiment, their group continued with the remaining four group members. Thus, our analysis for BT is based on nine groups with 5 members per group and one group with four members. Participants in the reduced group only saw the fifth participant is always writing empty reviews and therefore had the same information as the participants in the group of five. 16 86 participants rated the maximum of 20 reviews. 10 participants quit early; one of them rated one review, seven of them rated five reviews each, one of them rated 10 reviews, and one of them rated 15 reviews. We asked them to complete all reviewers but some quit earlier due to survey being run online. The comparison of these means, however, does not consider the quality of the reviews. Lower helpfulness ratings in BT might be justified if the review quality is lower. This is unlikely because reviews are significantly longer (216 characters in FWT and 302 in BT, two-sided Mann-Whitney test, z=-3.202, p=0.0014). More importantly, judged by the unbiased reviews from the survey, average review quality is higher in BT (3.17 in FWT and 3.56 in BT, z=-2.156, p=0.0311, two-sided Mann-Whitney test, here we have only 10 observations per treatment). Hence, we can exclude the possibility that lower helpfulness ratings in BT are driven by the quality of the reviews.
Result 1: There is strategic downvoting in treatment BT. We find clear evidence for Hypothesis H1. Figure 4 illustrates Result 1 graphically by poltting the mean helpfulness ratings each review received in the experiment against the rating it received in the survey. Points above (below) the 45-degree line indicate reviews for which the helpfulness rating in the survey was lower (higher) than the helpfulness rating in the experiment. In FWT, all points are distributed around the 45-degree line: 18 points lie above, 19 below, and 3 exactly on the 45-degree line. This is very different in BT, where only 3 points lie below and 37 points lie above the 45-degree line. To further validate result 1, we run a linear regression 17 of the helpfulness rating on review length, an indicator variable for BT, and the interaction between both independent variables (see first column in Table 1). The review length positively impacts the helpfulness rating. The coefficient for the indicator variable for treatment BT is negative and significant. Also, the coefficient for the interaction term (BT * review length) is negative and significant. Hence, an increase in review quality leads to an increase in the helpfulness rating, but the increase is smaller in BT. In the second regression in Table 1, we add controls for rounds (taking round 1 as baseline). Note that the indicator variables for rounds do not only capture the effect of learning, but also individual product characteristics, as participants rated a different product each round. Our results are robust to the inclusion of round effects. 17 Results are robust with respect to the model specification. Order-logit regressions or random effects regressions yield qualitatively identical results. For sake of simplicity we only report the OLS estimates.  The evidence described in the previous paragraphs indicates that helpfulness ratings differ between treatments. Helpfulness ratings expressed by participants in BT are clearly biased downwards. Hence, they do not reflect the true underlying quality but are driven by strategic downvoting. 18 As we cannot reject the H1 (strategic downvoting), we analyze the evidence related to the quality of the reviews. The comparison of helpfulness ratings from the survey reveals that review quality is higher in BT (see above). Contrary to our expectations, the data does not provide support for H2.
Result 2: Despite strategic downvoting, the average quality of reviews is higher in treatment BT.
If the quality of reviews is not affected by strategic downvoting, the question remains which factors do influence the quality of reviews. To shed some light on this question, we performed Tobit regressions with review length as the dependent variable (see Table 2).
Higher helpfulness ratings in a given round increase review quality in the following round. This positive correlation indicates that helpfulness ratings may act as approval, similar to Masclet et al. (2003), Dugar (2013) and Greiff and Paetzel (2015).
A similar effect arises due to the dynamics within a group. If all other group members provide high-quality reviews, participants also increase the quality of their reviews. These self-reinforcing effects are similar to the coordinating effect of high contributions often observed in public good games (e.g., Weimann 1994).
We also examined the influence of winning the bonus on review behavior in the round following the win. Winning the bonus had a negative but non-significant effect on review quality.
We included controls for the product score and WTP. Only in FWT, the score has a weakly significant effect, indicating that in this treatment, participants who perceive the product as better tend to write longer reviews.
Controlling for gender and economics-related major showed that reviews written by female participants were significantly longer in FWT but not in BT. Participants with an economics-related major wrote shorter reviews in FWT.  As we do not find an effect on the length of reviews written, we may have an additional look in the behavior of time as a robustness check to our analysis.  Result 3: Over time, review quality decreases in FWT but not in BT. Strategic downvoting becomes more severe over time.

Behavior Over Time
Similarly to repeated public goods games, in FWT, we observe that the number of characters written decreases over time. In BT, the length of reviews does not decrease but stays constant. It seems that the existence of the bonus prevents a decrease in review quality. Participants does not shy away from writing long and qualitative reviews in spite of their reviews being voted down strategically. Learning does not a play a role here as we find these long reviews until round 4 in BT. The pattern is mirrored if we look at helpfulness ratings from the survey. In FWT, these ratings show a downward trend while in BT they stay constant. To the contrary, the helpfulness ratings in the experiment show a very different pattern. While in FWT they correlate with the length of the review, in BT the helpfulness rating sharply decrease over time instead of the length of reviews being high and constant.  Table 3: Review writing behavior over time with review length as the dependent variable. Random effects regressions, standard errors in parentheses. * p < 0.10, ** p < 0.05, *** p < 0.01. In FWT, data from 40 participants, rounds 1-4. In BT (second column), data from 49 participants, rounds 1-4. In BT (last column), data from 48 participants, rounds 1-4 (one participant excluded because he/she did not indicate his/her gender).
In Table 3, we regress review length on rounds. Only in treatment FWT, the coefficient for the round is negative and significant, indicating that review length decreases by about 30 characters per round. 20 The regressions support the results found in Figure 5.

CONCLUSION
Both quantity and quality of customer-written product reviews have positive impacts on purchase intentions, sales, and customer satisfaction. Retailers and online platforms use different incentives to increase review quantity and quality. In our experiment, we focus on review quality and compare two different incentive schemes. Under the first incentive scheme, reviewers receive a flat salary per review, independent of the review's quality. Under the second incentive scheme, only the reviewer who wrote the highest-quality review receives a bonus.
Under both incentive schemes, review quality is determined through helpfulness ratings, which are assigned by the other reviewers. Theory predicts that the bonus will lead to strategic downvoting. Reviewers will assign low helpfulness ratings to reviews written by others, because this maximizes the chances of winning the bonus. If reviewers anticipate strategic downvoting, the quality of reviews might deteriorate because of crowding-out effects.
Our data shows that the quality-contingent bonus indeed leads to strategic downvoting.
Although the data provides clear evidence for strategic downvoting, the bonus does not have a negative effect on review quality. Review quality remains constant in the presence of the bonus scheme but decreases over time when reviewers receive a flat salary.
The existence of strategic downvoting could be problematic because the biased helpfulness ratings lead to a loss in signaling power. If customers cannot rely on helpfulness ratings to help them find high-quality reviews, their search and evaluation costs will increase if they realize this bias. If not, they base their decision of inferior reviews and may regret their purchase decision.
Retailers can counteract the loss in signaling power because only reviewers have an incentive for strategic downvoting. "Voters", who never write reviews, have no incentive to do so. By focusing on helpfulness ratings assigned by these customers, retailers can identify the reviews that are most helpful (based on unbiased ratings).
With respect to the literature on public goods, our study indicates that a monetary bonus given to the "best" contributor increases efficiency, even though the "best" contributor is determined endogeneously, which could give rise to strategic downvoting. However, caution should be exercised when generalizing the results from our study to other public good style situations. We have analyzed a market for reviews where each and every review receives exactly the same number of helpfulness ratings from all other participants. Moreover, there are no opportunity costs of assigning helpfulness ratings. Because of these differences, our study might not adequately capture many features of "real-world" public good style situations. It would be worthwhile to analyze how the presence of opportunity costs affect the assignment of helpfulness ratings and consequently, contributions. A further limitation of this study is the short-run examination of review behavior. It remains unclear whether a bonus still has no negative effect on review quality when review behavior is observed over an extend period of time. These aspects require are beyond the scope of our paper, which had the more modest goal of identifying whether a monetary bonus affects the assigment of helpfulness ratings and participants' review writing behavior.
The results derived from this study open up new and interesting questions for future research on how different incentive schemes affect review quality. Future research could use more complex experimental designs to analyze the changes discussed in the previous paragraph. In addition, field data could be used to shed some light on the size of the "downvoting" bias in existing review systems and could develop and test alternative helpfulness-based incentives that do not give rise to strategic downvoting.

Oral instructions (read aloud before the experiment, in German)
Welcome to the experiment and thank you very much for your participation. I will briefly read you some general explanations about the experiment. Please do not click on "Start experiment" until the end of these instructions. The participants of the experiment are all here in this room and are all taking part in the same experiment. The experiment aims at gaining insights on human behavior. The experiment lasts about 90 minutes and on average you will receive between 7 and 15 Euros, depending on your behavior, but at least 2 Euros. You will play anonymously and can't coordinate with each other. The disbursement of payoffs will also be carried out anonymously. No other participant will see how much you receive and the experimenters will not find this out either. During the experiment you may have to wait for the other participants. This may take a few minutes. Please remain patient during this time. When everyone has finished the experiment, you will be asked to go outside one after the other. There you will receive your payment. All instructions and explanations can be found on the following screen pages. Please read all the information carefully before leaving a screen by mouse click. Once you leave a screen, you will not be able to access it again. Never use the back function of the browser and do not surf on other websites during the experiment. We will log the accessed pages during the experiment. Never close the browser! A violation of these rules will exclude you from any payoff. Please remain calmly seated at your workplace. Please refrain from any conversations. If you have any questions, please raise your hand. We will then come to you. Now click on "Start experiment".

Instructions -Welcome to the experiment!
Thank you very much for your participation! At the beginning of the experiment the general laboratory procedures will be explained. These will be read aloud by the experimenter. Please click on "Start experiment" as soon as you are asked to do so. Never use the back-function of the browser and do not surf on other websites during the experiment. We log the accessed pages during the experiment. Never close the browser! A violation of these rules will exclude you from any payoff.

Instructions -General information
In this experiment you have the task to write product reviews in an online shop and to evaluate the reviews of 4 other reviewers. You always interact with the same 4 reviewers over 4 rounds throughout the experiment. Each reviewer has identical tasks and receives the same instructions. At the beginning of the experiment you will receive €6.4 as initial endowment. In each round you will receive earnings or deductions to your payoff account depending on your behavior. The remaining payoff account at the end of the experiment will be paid to you. In the experiment characters will be converted into Euro. 25 characters = 10 Eurocents.

Instructions -Phase 1: Writing the review
Each round consists of three phases, which are explained to you one after the other.

Review -Rating -Round result
At the beginning of each round you will receive a product to which you should write a short review. The review should describe the product and make it easier for other users to make a purchase decision.
You will also be asked to give the product an overall grade (school grade 1-6) and indicate how much you are willing to pay to buy the product. This information will not be shared with other reviewers.

The effort
Of course, writing a review involves a lot of effort. The more characters, the higher the effort. The characters (including spaces and line breaks) you use for your review will be deducted from your payoff account. Example: You write 0 characters => deduction: €0 You write 400 characters (maximum) => deduction: €1.6 The benefit Writing a review provides a benefit for all other reviewers. The benefit lies in the fact that a large number of reviews facilitates the purchase decision and reduces product uncertainty when buying online.
This means that all other reviewers (except you) each receive half of their characters added to their account balance.
In the same way, half of all characters written by the other reviewers will be credited to your account. So you benefit if the other reviewers write a lot.
Reviews without content (e.g. only spaces, 100 times the letter x) or meaningless content will not be credited. Example: You write 400 characters (your deduction: €1.6) => credit 200 characters (=€0.8) for each other reviewer = credit characters for all others together: €3.2.

Instructions -Phase 2: Rating reviews of others
After writing your own review you will be asked to evaluate the reviews of the 4 other participants regarding their helpfulness.

Helpfulness Assessment
Helpfulness is the usefulness of the review for a possible purchase decision. You can rate helpfulness on a scale of 5 from unhelpful (1 star) to very helpful (5 stars).
Please note that all reviews must be rated (at least 1 star = not helpful at all).

ONLY IN TREATMENT BT
Award for the review with the best helpfulness rating.
The review that gets the best average helpfulness rating from the other 4 reviewers gets a price of €5 in addition to its normal payoff.
If several reviews have the same average rating, the price will be split.

ONLY IN TREATMENT FT
The rating has no effect on the payoffs.

Instructions -Phase 3: Result of the round
At the end of each round, you will be informed about your total payoff for that round and the average helpfulness rating for your own review.
ONLY IN TREATMENT 1 You will also receive €1 for evaluating the product (overall impression, willingness to pay) and rating the other reviews.

Your payoff at the end of the round
You have written a review with X characters. These characters will be deducted from your account balance. The other 4 reviewers wrote reviews with a total of Y characters. Since you also benefit from the reviews as a user, you will receive half of the characters credited as a payout. ONLY IN TREATMENT 1 You will additionally receive 1 € credited for evaluating the product and rating the other reviews.

ONLY FOR THE BEST REVIEW IN TREATMENT 2
You have the best helpfulness rating and get an additional payoff of € 5.

Purchase of test products
In the first phase you indicated how much you are willing to pay for the product. If this willingness to pay is higher than the purchase price of the product (purchase on the Internet), you will receive the product at purchase price and the purchase price will be deducted from your payoff account. Please note that you will receive the product at purchase price. Important: The indication of the willingness to pay is binding.

Example
If you are willing to pay a high price, you can enter it as willingness to pay, but you will then receive the product at the cheaper purchase price.

Payment and questionnaire
At the end you will find an anonymous questionnaire about your experiences with online review systems.
When you leave the laboratory, you will receive your payment and, if applicable, the products you have purchased (in their original packaging).

Decision screens
All decision screens show a bar with (1) the round number, (2) the stage in the current round (review, rating, result of the round), (3) the total payoff (excluding the current round) and (4) the conversion rate of characters into Euros (25 characters = 10 Eurocents).

Decision Screen 1 -Writing review
On the first screen participants can write their review. The actual text on screen is marked by "". The remaining text is explanation.
1) General instructions for writing the review: "The product is now distributed. It is collected after the round. Please write a review about the product. You have max 400 character available. Each character that you write will be deducted from your payoff account. By writing the review, you create utility for the other reviewers. They each get half of your characters as payoff." 2) In the text field, subjects could write their review and were informed about the remaining characters.
3) In the info field, subjects could see (in live) how writing affected their payoff. The text states: "Your payoff account (after deducting the characters from your review)".

4)
In the info field, subjects could see (in live) how writing affected the others' payoff. The text states: "Euros that will be given to each of the 4 other reviewers.".

5)
This message informs subjects about the fact that "The following inputs are not shown to the other reviewers". (meaning 6 and 7). 6) Subject have to rate the product. "Please rate your overall impression of the product with a school grade (1-6)". In Germany, 1 is the best and 6 the worst grade. You pass with at least a 4. 7) Subjects have to express their willingness to pay for the product. "How many € are you ready to pay to buy the product? (willingness to pay)".
8) The message at the bottom informs subjects about the binding nature of the input in 7. "If your willingness to pay is above the buying price of the product, you get the product at the end of the experiment for the buying price. This input is binding." 9) "Submit inputs"

Decision screen 2 -Rating reviews
Then subjects rate the reviews of the 4 other reviewers.
1) The instruction state "Please rate the helpfulness of the 4 other reviews with a star rating.
(1 = not helpful at all, 2 = little helpful, 3 = average helpful, 4 = helpful, 5 = very helpful). The reviewer with the highest average rating in your group of 5 gets additional 5 Euros. With a tie, the award will be split". (for the BT treatment). In the FT, the last two sentences were changed to "The rating does not influence your payoff." The €5 bill was only shown in the BT treatment.
2) Each review is headed by stating "Participant A has written the following review:" 3) The review was reproduced. The number of characters was added in brackets "(This review has 347 characters)". 4) "Please rate the helpfulness of the review".

Decision Screen 3 -Feedback
The third screen provides feedback.
1) In the BT treatment, subjects were informed if they had the highest rating. "You have the highest helpfulness rating and get €5." If not they were informed as well.
In the FT treatment, this part was omitted.
2) Subjects were informed about the average helpfulness rating they obtainted. "Your review got an average helpfulness rating of 1.25 by the 4 other reviewers".
3) Feedback on payoffs was provided: "In this round, you received the following payoff. You wrote a review with 208 characters. These characters were deducted from your payoff account. The other reviewers wrote reviews with 359 characters in total. As a user, you profit from these reviews and you get half of all characters as payoff. You had the highest helpfulness rating and got an additional payoff of €5." The last sentence was changed accordingly in treatment FT to "You will additionally receive 1 € for evaluating the product and rating the other reviews." 4) Feedback on total payoff. "Overall, in this round your payoff has increased by €4.886".
5) Button to start the next round with warning "Please click only on the button once you read all instructions".

Questionnaire after experiment
Thank you for participating in the experiment! The following questionnaire will collect your experiences with online product reviews in the "real world". Please fill in the following questionnaire carefully. Answering the questions takes about 10-15 minutes.
The data collected cannot be personally attributed to you. This questionnaire has 29 questions.

Experience as reviewer -Part I
The following questions ask about your experience when writing online product reviews. (1 = absolutely true, 2 = true, 3 = hardly true, 4 = not true) -In return, I'm receiving monetary incentives such as money or coupons.
-I will receive free test products in return.
-My experience helps other customers with the assessment of product quality.
-I enjoy it to write reviews.
-I like to be in contact with other reviewers and readers.
-A reputation as a good reviewer is important for me.
-I want to support other customers to buy the right products.
-I like to exchange myself with people who have similar interests.
-I hope that others reviewers and readers will give me advice on problems with products. Question only asked if product review written (question 1). Please select the appropriate answer for each item:

Experience as reviewer -Part II
(1 = absolutely true, 2 = true, 3 = hardly true, 4 = not true) -I want to help other people with my positive experience.
-I want to give other people the opportunity to buy the right product.
-So I can express the joy of a good purchase.
-I like to tell other people about a successful purchase.
-So I can show other people that I have bought cleverly.
-I would like to recommend the company.
-I like to support good companies.
10. What are the reasons for you to write a negative product review? Question only asked if product review written (question 1). Please select the appropriate answer for each item: (1 = absolutely true, 2 = true, 3 = hardly true, 4 = not true) -This is how I better process the frustration over a bad buy.
-So my anger over a bad buy is reduced faster.
-I want to warn other customers about bad products.
-I want to spare other customers a bad product experience.
-I want to pay back the manufacturer of the bad product.
-It is less time-consuming than complaining to the manufacturer by phone or e-mail. -I believe that the manufacturer will solve the problems of their product faster if I discuss them publicly. -The provider of the review platform will forward my complaint to the right place at the manufacturer.
11. How often do you shop on Amazon? -More than 30 products a year -Between 30 and 10 products a year -Less than 10 products a year -Never 12. How often do you shop on the Internet? -More than 30 products a year -Between 30 and 10 products a year -Less than 10 products a year -Never 13. How often do you read product reviews on Amazon before you make a purchase of a product? -Always -Often -Sometimes -Rarely -Never 14. How often do you read product reviews on other online platforms before you make a purchase of a product? Question only asked if customer never read on Amazon (question 13).

Question only asked if review are read (question 13).
Please select the appropriate answer for each item: (1 = always, 2 = often, 3 = sometimes, 4 = rarely, 5 = never) -is displayed first automatically -was rated as most helpful by other customers -is the most recent -evaluate the product best -evaluate the product worst 16. Before I decide to purchase a product, I read all the available product reviews (on Amazon).

Reading Motivation Customers
17. I read the product reviews of other customers, ...

Question only asked if reviews read (question 13).
Please select the appropriate answer for each item: (1 = absolutely true, 2 = true, 3 = hardly true, 4 = not true) -because they're helping to make the right purchase decision.
-because it saves a lot of time if I want to inform me about a product before buying it. -in order to find advice and solutions for problems. -because I feel better, when I read that other people have the same problem with a product. -to benefit from the experiences of others, before I buy a product.
-because it is the fastest way to get information about a product. -because I get to know about recent trends.
-to find confirmation, that I have bought the right product.
-to compare my product evaluation with that of other people's. -because I like to share the experiences with others reviewer.
-because I am rewared for reading and rating (e.g. vouchers, free test products). -to find the right answers when I have problems with the product.
-because I like to be part of the review community.
-because I am interested in new products.
-to find out if I am the only one with a certain opinion about a product.

How often have you rated the reviews of other customers?
Question only asked if reviews read (question 13).
-Never -Up to 10 times -Up to 20 times -Up to 100 times -More often Customer view Helpfulness 19. I find a customer review particularly helpful if it...

Question only asked if reviews read (question 13).
Please select the appropriate answer for each item: (1 = absolutely true, 2 = true, 3 = hardly true, 4 = not true) -Discusses the disadvantages of the product.
-Discusses the advantages of the product.
-Is easy to read.
-Contains much expert information.
-Discusses both advantages and disadvantages of the product.
-Is very short.
-Extensively discusses the experiences of the reviewer with the product.
Customer Opinion Formation 20. I find reviews that are already rated as "helpful" by many other customers. Question only asked if reviews read (question 13).
- 29. How old are you?

Appendix B -Additional Results
What are the factors that drive the assignment of helpfulness ratings? To answer this question, we performed an OLS regression for each treatment, with the helpfulness rating as the dependent variable (see Table 4). 22 The independent variables fall into four categories.
The first category refers to review quality and includes only review length, our proxy for quality. In FWT, review length is positively and highly correlated with helpfulness. Helpfulness increases by 0.843 units for 100 additional characters; in BT, the increase is only 0.157 (and significantly lower than 0.843, two-sided t-test, p<0.001).  Table 4: OLS regressions with helpfulness ratings assigned to other as dependent variable; "review length other" is the number of characters (in 100s) of the review written by the participant to whom the helpfulness rating is assigned; "review length own" is the number of characters (in 100s) of the review written by the participant who assigs the helpfulness rating. Standard errors clusters by participants in parentheses. * p < 0.10, ** p < 0.05, *** p < 0.01. In FWT, data from 40 participants, rounds 2-4, 4 reviews written per participant and round. In BT, data from 49 participants, rounds 2-4, 4 reviews written per participant and round. For FWT we have 480 observations because we have 40 participants (8 groups with 5 participants per group), 3 rounds (round 2 to 4), and each participant wrote 4 reviews per round. For BT, we have 49 participants (10 groups with 5 participants per group, minus 2 participants who had to be excluded), 3 rounds (round 2 to 4), and each participant wrote 4 reviews per round. One participant was excluded because they closed the browser during the experiment and another participant is excluded from this analysis because of missing data for gender.
The second category contains all variables which reflect strategic considerations. Due to strategic considerations, the length of a participant's own review could be positively correlated with the helpfulness rating. Consider a participant who wrote a short review while her group members wrote long reviews. In order to win the reward, she has to obtain the highest helpfulness rating in her group. If she expects that longer reviews receive higher helpfulness ratings, she expects the other group members' helpfulness ratings to be higher than the helpfulness rating she receives. In such a situation, the only thing she can do to maximize her chances of winning the reward is to assign lower ratings to others. In fact, we expect an inverse relationship between the length of the review written by a participant and the incentive to downvote others. The shorter the own review, the greater incentive to downvote other participants. Contrary to these considerations, we found the length of a participant's own review to have no effect on the helpfulness rating she assigns.
The helpfulness rating received in the previous round could have a positive effect. Such a pattern would be consistent with indirect reciprocity, where participants who have received a high rating in one round (a form of approval) are more likely to assign a high rating in the next round. Our data does not support this pattern. In both treatments, the previous round's helpfulness rating has no effect.
The last strategic factor is the event of winning the bonus. In BT, winning the bonus has no significant effect on helpfulness ratings.
The third category of variables pertains to product evaluation. Differences between participants in their perception of a review's helpfulness could be driven by differences in product evaluation. Participants may consider reviews which voice different opinions to their own on the product less helpful in a purchase decision. We control for this effect with the variables "product score" and "WTP" (both standardized as z-scores). To account for the possibility that only large differences may matter and the direction of the deviation may not, we include squared differences in scores and WTP. Only in FWT, the difference in scores influences the assignment of helpfulness ratings in the expected direction. In BT, different product perceptions does not influence the helpfulness ratings.
The fourth category contains a set of control variables for gender and field of study. None of the control variables had a significant effect. In both regressions in Table 4 we control for group and round effects. 23 Summarizing these results, we can say that the assignment of helpfulness ratings is driven by review length. All other factors had only small or non-significant effects. The helpfulness ratings are proxies for review length, which are not biased by strategic considerations or differences in product evaluations. This implies that the differences in correlations between treatments, which are reported in , cannot be explained by differences stemming from strategic considerations or differences in product evaluations.
Comparing the values for R 2 , we see that in FWT compared to BT, the dependent variables explain a much larger share of the variance in helpfulness ratings. This is in line with the presence of strategic downvoting.