Reputation as a sufficient condition for data quality on Amazon Mechanical Turk

Peer, Eyal; Vosgerau, Joachim; Acquisti, Alessandro

doi:10.3758/s13428-013-0434-y

Reputation as a sufficient condition for data quality on Amazon Mechanical Turk

Published: 20 December 2013

Volume 46, pages 1023–1031, (2014)
Cite this article

Download PDF

Behavior Research Methods Aims and scope Submit manuscript

Reputation as a sufficient condition for data quality on Amazon Mechanical Turk

Download PDF

Eyal Peer¹,
Joachim Vosgerau² &
Alessandro Acquisti³

15k Accesses
1069 Citations
21 Altmetric
1 Mention
Explore all metrics

Abstract

Data quality is one of the major concerns of using crowdsourcing websites such as Amazon Mechanical Turk (MTurk) to recruit participants for online behavioral studies. We compared two methods for ensuring data quality on MTurk: attention check questions (ACQs) and restricting participation to MTurk workers with high reputation (above 95% approval ratings). In Experiment 1, we found that high-reputation workers rarely failed ACQs and provided higher-quality data than did low-reputation workers; ACQs improved data quality only for low-reputation workers, and only in some cases. Experiment 2 corroborated these findings and also showed that more productive high-reputation workers produce the highest-quality data. We concluded that sampling high-reputation workers can ensure high-quality data without having to resort to using ACQs, which may lead to selection bias if participants who fail ACQs are excluded post-hoc.

Evaluating CloudResearch’s Approved Group as a solution for problematic data quality on MTurk

Article Open access 03 November 2022

Data quality of platforms and panels for online behavioral research

Article 29 September 2021

Fast, Cheap, and Unethical? The Interplay of Morality and Methodology in Crowdsourced Survey Research

Article 08 December 2017

An increasing number of social scientists are capitalizing on the growth of crowd-sourced participant pools such as the Amazon Mechanical Turk (MTurk). One of the main issues that has been occupying researchers using this pool of participants is data quality (e.g., Goodman, Cryder, & Cheema, 2013). Recent studies have shown that various forms of attention check questions (ACQs) to screen out inattentive respondents or to increase the attention of respondents are effective in increasing the quality of data collected on MTurk (e.g., Aust, Diedenhofen, Ullrich, & Musch, 2013; Buhrmester, Kwang, & Gosling, 2011; Downs, Holbrook, Sheng, & Cranor, 2010; Oppenheimer, Meyvis, & Davidenko, 2009). Such ACQs usually include “trick” questions (e.g., “Have you ever had a fatal heart attack?”; Paolacci, Chandler, & Ipeirotis, 2010) or instructions that ask respondents to answer a question in a very specific way (e.g., to skip it or enter prescribed responses). The main objective of these ACQs is to filter out respondents who are not paying close attention to the experiment’s instructions. Additionally, including such ACQs in an experiment can help increase or ensure participants’ attention, since they do not know when to expect another trick question as the experiment progresses (Oppenheimer et al., 2009).

The use of ACQs can be particularly effective when researchers have no prior knowledge about participants’ motivation and capacity to read, understand, and comply with research instructions. MTurk, however, offers researchers information about the participants’ past performance, or reputation, in the form of approval ratings. Every time that a participant (a.k.a., “worker”) on MTurk completes a task (a.k.a., “Human Intelligence Task,” or “HIT”), the provider (a.k.a., “requester”) of that task can approve or reject a worker’s submission. Rejecting a worker’s submission also involves denying that worker her or his payment for completing the HIT and reflects badly on that worker’s account. Furthermore, it can reduce the variety of HITs that a worker can work on in the future, because requesters can demand that a worker have a minimum number of previously approved HITs to be eligible for their HIT. Although MTurk does not disclose individual workers’ approval ratings to requesters, it allows requesters to set a minimum qualification for workers to view and complete a HIT (e.g., that 95% of their previous HITs were approved). The main objective of setting this kind of qualification is to try to ensure that the responses collected for the study will be reliable and credible, and will allow the research to reach its objectives.

In this article, we compare the effectiveness of these two methods for ensuring data quality on MTurk: restricting samples to MTurk workers with high reputation (e.g., 95 % or more of their previous HITs approved) versus using ACQs to screen out inattentive workers and/or to increase their attention. We compared both methods in terms of their validity, reliability, and replicability of research findings.

Attention checks versus approval ratings

Having participants pass ACQs or sampling those who have a high reputation could both improve data quality but may also bear unintended consequences. Restricting participation to MTurk workers with high reputation reduces the size of the population from which a sample is drawn, thereby potentially prolonging the time needed to reach a required sample size. Furthermore, sampling bias may result if workers with high reputation differ from those with low approval ratings on dimensions other than attention and willingness to comply with experimental instructions.

Using ACQs to screen out inattentive respondents, on the other hand, diminishes sample size and can lead to unequal experimental cell sizes and selection bias if responses are excluded after data collection is completed (Oppenheimer et al., 2009). Furthermore, ACQs might backfire. For example, ACQs such as “Have you ever, while watching TV, had a fatal heart attack?”—to which an attentive respondent must respond with “never” (Paolacci et al., 2010)—may cause reactance on the respondents’ part. An attentive respondent might take offense at the surveyor’s implicit assumption that he or she does not pay enough attention, and react by being less thorough in subsequent responding or by providing outright wrong answers. Although other ACQs can be less offensive (e.g., researchers can explain, in the ACQ, why it is important for them to make sure that participants are reading the instructions), adding an unrelated question (such as ACQs) can potentially disrupt the natural flow of a study. If ACQs are necessary to obtain high-quality data, then a relatively small disruption in the study’s flow is probably negligible. However, if ACQs do not improve data quality (or do so only for certain groups of MTurk workers—such as those with low reputation), then the use of ACQs should probably be discouraged, to avoid potential reactance and selection bias.

To compare the effectiveness of both methods, we ran two experiments on MTurk in which we orthogonally varied MTurk workers’ reputations (below vs. above 95 %) and the use of ACQs in the study (mandatory vs. absent). We assessed data quality in terms of reliability, validity, and replicability. For reliability, we asked participants to fill out several validated scales measuring individual differences (in personality, self-esteem, need for cognition, and social desirability). We used the social desirability scale also to assess data quality in terms of validity—assuming that more socially desirable responses are less valid. Finally, following Paolacci et al. (2010), we assessed data quality in terms of the replicability of well-known effects.

In the first experiment, we focused on comparing high- versus low-reputation workers and manipulated the use of ACQs to assess the contribution of each method to increasing data quality. In the second experiment, we replicated the first experiment’s results using different (and less familiar) ACQs, and also examined differences between workers with different productivity levels (i.e., those who had completed fewer vs. more previous HITs on MTurk).

Experiment 1

Method

Sampling and participants

Over 10 days, we sampled U.S. respondents from two populations on MTurk: workers with above 95 % approval ratings (high reputation), and workers with below 95 % approval ratings (low reputation). The cutoff of 95 % was chosen because—as a default setting in MTurk—it is used by many researchers. The cutoff, however, is arbitrary, and higher or lower cutoffs can be used for distinguishing high- versus low-reputation workers. The responses of 694 workers, 458 with high reputation and 236 with low reputation, were obtained. A power analysis showed that with these sample sizes, effect sizes of d = 0.25 and above would be detected in about 90 % of the cases. To verify workers’ reputations, we asked them to report their approval ratings. Although 91.1 % of the high-reputation workers confirmed that they had a higher than 95 % approval rating, 36.0 % of the low-reputation workers claimed to have an approval rating of above 95 %, χ ²(5) = 263.3, p < .001. Rather than doubting the validity of MTurk’s qualification system, we believe that these participants—intentionally or not—misreported their approval ratings. No statistically significant differences in either gender [χ ²(3) = 2.04, p = .56] or age [F(3, 690) = 1.59, p = .19] were found across groups (see Table 1).

Table 1 Demographics by group in Experiment 1

Full size table

Design

About 70 % of each sample (high- and low-reputation workers) were administered ACQs, and the remainder were not. ACQ conditions were oversampled because we wanted to compare the responses of those who failed to those who passed ACQs (see the samples’ sizes in Table 1).

Procedure

Participants were invited to complete a survey about personality. The survey started with demographic questions, followed by the Ten-Item Personality Inventory (TIPI; Gosling, Rentfrow, & Swann, 2003), Rosenberg’s 10-item Self-Esteem Scale (RSES; Rosenberg, 1979), the short, 18-item form of the Need for Cognition scale (NFC; Cacioppo, Petty, & Feng Kao, 1984), and the short, 10-item form of the Social Desirability Scale (SDS; Fischer & Fick, 1993). All measures used 5-point Likert scales with the endpoints strongly disagree (1) and strongly agree (5), except for the SDS, which used a binary scale with agree (1) and disagree (0). Participants were then asked to complete a classic anchoring task (Tversky & Kahneman, 1974): They first entered the last two digits of their phone number, then indicated whether they thought that the number of countries in Africa was larger or smaller than that number, and finally estimated the number of countries in Africa.

In the ACQ condition, three ACQs were included in different parts of the survey. The Instructional Manipulation Check (IMC; Oppenheimer et al., 2009) was inserted right after the demographic questions. Participants were asked “Which sports do you like?,” but hidden in a lengthy text were instructions to ignore the question and simply to click on “Next.” The second ACQ (after the NFC questionnaire, before the anchoring task) asked—among other, unobtrusive questions—“While watching TV, have you ever had a fatal heart attack?” (Paolacci et al., 2010). The last ACQ, at the end of the survey, asked participants “What was this survey about?,” preceded by instructions not to mark “Personality” but instead to choose “Other” and type “Psychology” in the text box (adapted from Downs et al., 2010). Participants were paid 50 cents.

Results

Attention check questions

We compared the rates of failing ACQs between high- and low-reputation workers. As can be seen in Table 2, only 2.6 % of high-reputation workers failed at least one ACQ, as compared to 33.9 % of the low-reputation workers [χ ²(1) = 89.46, p < .001]. For example, 0.4 % of the high-reputation workers indicated that they had had a fatal heart attack while watching TV, whereas 16.4 % of the low-reputation workers claimed to have suffered such a deadly incident. Given that almost all of the high-reputation workers (97.4 %) passed all ACQs, for the subsequent analyses we created five comparison groups: high-reputation workers who either received (and passed) ACQs or did not receive ACQs, and low-reputation workers who either passed all ACQs, failed ACQs at least once, or did not receive any ACQs. The sample sizes for these groups are given in Table 3.

Table 2 Proportions of participants who failed ACQs

Full size table

Table 3 Data quality measures among high- and low-reputation workers

Full size table

Reliability

We regarded internal consistency (Cronbach’s alpha) of established scales as evidence of data quality and compared it between the different groups of workers. First, we compared the reliability scores for the SDS, RSES, and NFC scales of the high- versus low-reputation workers (we did not examine the reliability of the TIPI scales because each scale only had two items). High-reputation workers produced higher reliability scores on all three scales (.635, .935, and .950 for the SDS, RSES, and NFC, respectively) than did low-reputation workers (.452, .887, and .865, respectively). Using Hakstian and Whalen’s (1976) test for statistical significance between independent reliability coefficients, we found that the differences in reliabilities between high- and low-reputation workers were statistically significant for all three scales, χ ²s(1) = 13.75, 19.86, and 62.60, ps < .001. Participants who had failed ACQs produced lower reliability scores on all three scales (.563, .821, .761, respectively) than did those who had passed (.601, .931, and .942) or had not received ACQs (.666, .923, and .932), χ ²s(1) > 13.75, ps < .001. When testing all possible pairwise comparisons among the five groups, using Bonferroni’s correction for post-hoc comparisons,^{Footnote 1} lower reliabilities were found among low-reputation workers who either failed or had not received ACQs, relative to high-reputation workers (whether they had or had not received ACQs; see Table 3).

Social desirability

We regarded socially desirable responding as evidence of lower data quality. Comparing the five groups, we found statistically significant differences, F(4, 689) = 2.52, p = .04, η ² = .014. Low-reputation workers who had failed ACQs had the highest SDS scores, whereas high-reputation workers showed the lowest scores (p < .05; see Table 3). However, none of the pairwise comparisons was statistically significant (after Bonferroni correction; see Table 3).

Anchoring task

Following Paolacci et al. (2010) and Oppenheimer et al. (2009), we regarded the replicability of well-established effects as evidence for high-quality data. Numerous studies have shown that answering a hypothetical question about a clearly arbitrary anchor (e.g., the last two digits of one’s phone number) influences subsequent unrelated number estimates (e.g., Tversky & Kahneman, 1974). We expected high-reputation workers to be more likely to show the classic anchoring effect than low-reputation workers, because inattentive respondents are more likely to be distracted during the task, which should weaken an anchoring effect. The last two digits of phone numbers and the number of African countries showed the expected positive correlation―that is, evidence of an anchoring effect―among high-reputation workers (with and without ACQs) and among low-reputation workers who had passed the ACQs, but not among low-reputation workers who did not receive ACQs or had failed them (see Table 3). Bonferroni-corrected post-hoc comparisons showed that the differences between these correlations were statistically significant (ps < .05).

Central-tendency bias

To test whether workers differed in their tendencies to mark the midpoint of scales, regardless of the questions asked, we computed for each participant the relative frequency with which he or she had marked “3” on the 5-point scales in the TIPI, RSES, and NFC. An analysis of variance (ANOVA) on this central-tendency bias ratio showed significant differences between the groups, F(4, 689) = 12.76, p < .001, η ² = .07. As can be seen in Table 3, there was no difference in central-tendency bias between high-reputation workers who did or did not receive ACQs (p = 1.0). Among low-reputation workers, those who had passed ACQs showed a significantly greater central-tendency bias than did those who had failed ACQs (p = .006). The difference between low-reputation workers who had passed ACQs and those who did not receive ACQs was not statistically significant (p = .31; all p values are Bonferroni corrected).

Discussion

The results of Experiment 1 suggest that workers’ reputation can predict data quality: High-reputation workers were found to provide higher-quality data than did low-reputation workers. High-reputation workers rarely failed ACQs (97.4 % answered correctly), and their responses resulted in higher reliability scores for established measures and showed lower rates of socially desirable responding. High-reputation workers also exhibited the classic anchoring effect, whereas low-reputation workers did not. Low-reputation workers, in contrast, were found to be more likely to cross off the midpoint of scales, regardless of the question asked (central-tendency bias).

ACQs did improve data quality, but for low-reputation workers only, and only in some of the cases. For the RSES and NFC scales, reliability scores among low-reputation workers who had passed ACQs were just as high as scores obtained from high-reputation workers who had either passed or not received ACQs. For the SDS scale, however, even low-reputation workers who had passed all ACQs produced a significantly lower reliability on that measure. Similarly, ACQs helped improve data quality among low-reputation workers in terms of replicability. Low-reputation workers who had passed ACQs showed the classic anchoring effect (as did high-reputation workers regardless of having received or not received ACQs), whereas low-reputation workers who had either failed or not received ACQs failed to produce the expected effect. Finally, low-reputation workers showed higher levels of central-tendency bias, independently of whether they had received, passed, or failed ACQs.

More importantly, though, ACQs did not seem to have any effect whatsoever on the data quality of high-reputation workers. The responses of high-reputation workers produced high scale reliabilities whether or not ACQs were used, showed the same (low) degree of socially desirable responding, exhibited almost identical effect sizes in the anchoring task, and displayed the same (relatively low) level of central-tendency bias. This lack of differences in all of the measures that we used strongly suggests that ACQs (or, at least, the ACQs used in this experiment) do not have an effect on high-reputation workers. Such a null effect, however, can only be meaningfully interpreted if the experiment was adequately powered to detect small effects (Greenwald, 1975). Experiment 1, with almost 700 participants in total, would have detected effects of d = 0.25 and above with a probability of 90 %. It is hence unlikely that differences among high-reputation workers who did or did not receive ACQs actually existed but were not observed in Experiment 1. Rather, the results suggest that ACQs are generally ineffective in improving data quality among high-reputation workers, who produce very high-quality data to begin with.

However, although the ACQs that we used in this experiment did not improve data quality, other ACQs might do so. The fact that almost all high-reputation workers passed the ACQs suggests that high-reputation workers might be familiar with these specific (and common) ACQs (the ACQs that we used in Exp. 1 have been available to researchers for several years now; e.g., the IMC was published in 2009). If familiarity was the cause for the high passing rate of ACQs, novel and unfamiliar ACQs might increase the data quality for high-reputation workers in the same way that they do for low-reputation workers. Experiment 2 was designed to test that possibility.

Experiment 2

Experiment 2 employed the same design and measures as Experiment 1, with two exceptions: First, we replaced the ACQs used in Experiment 1 with novel ACQs that we designed ourselves (after soliciting examples from colleagues). Second, in addition to workers’ reputation, we orthogonally manipulated workers’ productivity. Worker productivity refers to the number of HITs that a worker has previously completed on MTurk. Similar to worker reputation (the percentage of approved HITs), MTurk allows researchers to specify how many HITs workers must have previously completed in order to view and complete a proposed HIT. Workers’ productivity levels seem to vary greatly. For example, about half of the participants in Experiment 1 indicated that they had completed more than 250 HITs, and about 10 % said they had completed more than 5,000 HITs. A worker’s productivity―just like a worker’s reputation―may be a predictor of data quality, such that highly productive workers may be more likely to produce high-quality data than less productive workers. That could be the case because (a) highly productive workers are workers who are more intrinsically motivated to complete HITs to the satisfaction of the requester; (b) highly productive workers represent “good” workers that had stayed on MTurk, whereas “bad” workers would drop out over time; and (c) highly productive workers would be more experienced in answering survey questions, and thus produce higher-quality data.

Experiment 2 served three purposes: first, to test whether the findings of Experiment 1 would replicate; second, to test whether novel and unfamiliar ACQs would improve data quality for high-reputation workers; and third, to test whether worker productivity would have the same effect on data quality as worker reputation.

Method

Sampling

Over 10 days, we sampled MTurk workers (who had not taken part in Exp. 1) from the U.S. with either high or low reputation (above 95 % vs. less than 90 % previously approved HITs), and with either high or low productivity levels (more than 500 HITs vs. less than 100 HITs completed). Unlike in Experiment 1, in Experiment 2 we manipulated the factors of both reputation and productivity in such a way that a gap separated the manipulated levels. In this way, we avoided MTurk workers with similar reputation/productivity levels (e.g., 95.1 % vs. 94.9 % approved HITs, or 501 vs. 499 completed HITs) being categorized into different groups. As a consequence, it should be easier to detect actual differences in data quality as a function of worker reputation/productivity. The cutoffs for productivity were chosen on the basis of the distribution of self-reported productivity levels in Experiment 1 (about 25 % indicated that they had completed less than 100, and about 30 % said that they had completed more than 500 HITs).

Sampling was discontinued when an experimental cell had reached about 250 responses, or after 10 days. Although we were able to collect responses from 537 high-reputation workers in less than two days, we only obtained responses from 19 low-reputation workers in 10 days. After two days of very slow data collection, we tried to increase the response rate by reposting the HIT every 24 hours (so that it would be highly visible to these workers) and increasing the offered payment (from 70 to 100 cents for a 10-min survey). Unfortunately, both attempts were unsuccessful and we received only 30 responses from low reputation workers after 10 days. We thus decided to focus only on high-reputation workers and on the impact of productivity levels and ACQs on data quality. The obtained sample size allowed for detecting effect sizes of at least d = 0.25 with a power of about 80 %.

Participants

We collected responses from a total of 537 MTurk workers with high reputation (95 % or above): 268 with low productivity (100 or less previous HITs) and 269 with high productivity (500 or more previous HITs). Both the high- and low-productivity groups included similar ratios of males (61.5 % vs. 58.6 %), χ ²(1) = 0.43, p = .51, but workers from the high-productivity groups were somewhat older than those in the low-productivity group (M _high = 34.36, SD _high = 12.45 vs. M _low = 32.08, SD _low = 12.62), t(499) = 2.04, p = .04, d = 0.18. As expected, high-productivity workers reported having completed a much larger number of HITs than did low-productivity workers (M _high = 10,954.78, SD _high = 38,990.67 vs. M _low = 138.64, SD _low = 151.49), t(499) = 4.38, p < .001, d = 0.39. Interestingly, many workers from the low-productivity group (about 43 %) claimed to have completed more than 100 HITs—a fact that should have prohibited them from taking our HIT. Some (about 24 %) of these claimed to have completed more than 250 HITs—an overreport that can be hardly ascribed to simple oversight or memory error. Lastly, high-productivity workers reported having, on average, a slightly higher ratio of previously approved HITs than low-productivity workers (M _high = 99.40, SD _high = 0.76 vs. M _low = 99.21, SD _low = 1.14), t(434) = 2.1, p = .04, d = 0.20.

Design

The participants in each group were randomly assigned to either receive or not receive (novel) ACQs. As in Experiment 1, we oversampled the condition that included ACQs (in a ratio of about 67:33).

Procedure

As in Experiment 1, MTurk workers were invited to complete a survey about personality for 70 cents. Participants first completed the TIPI, followed by the 10-item version of the SDS, the 10-item version of the RSES, and the 18-item version of the NFC scale. On the last page of the survey, participants were asked to indicate their gender and age and to estimate approximately how many HITs they had completed in the past and how many of those were approved (in contrast to Exp. 1, these questions did not include predefined options, but used an open text box in which participants entered their responses, allowing us to collect more granular data).

Participants in the ACQ conditions were asked to answer three additional questions (three novel ACQs): The first one presented participants with a picture of an office in which six people were seated, and asked them to indicate how many people they saw in the picture. Hidden within a lengthy introduction were instructions to workers to not enter “6,” but instead to enter “7,” to show that they had indeed read the instructions. Any response other than 7 was coded as failing this ACQ. The second new ACQ was embedded in the middle of the NFC scale, in the form of a statement that read: “I am not reading the questions of this survey.” Any response other than “strongly disagree” was coded as failing this ACQ. The last novel ACQ consisted of two questions that asked participants to state whether they “would prefer to live in a warm city rather than a cold city” and whether they “would prefer to live in a city with many parks, even if the cost of living was higher.” Both questions were answered on 7-point Likert scales with the endpoints strongly disagree (1) and strongly agree (7). Participants were instructed, however, not to answer the question according to their actual preferences, but to mark “2” on the first question and then add 3 to that value and use the result (i.e., 5) as an answer to the second question.^{Footnote 2} Any deviating responses were coded as failing this ACQ.

Results

Attention check questions

As can be seen in Table 4, among those who received the ACQs (about 2/3 of each group), 80.3 % of the high-productivity workers passed all of them, relative to 70.9 % of the low-productivity workers, χ ²(3) = 12.63, p = .006. As in Experiment 1, we classified workers of each productivity group according to whether they had passed all ACQs, had failed at least one of the ACQs, or did not receive ACQs at all (see Table 5 for the groups’ sizes).

Table 4 Rates of passing/failing unfamiliar ACQs in Experiment 2

Full size table

Table 5 Internal reliability and social desirability scores for the groups in Experiment 2

Full size table

Reliability

As in Experiment 1, we regarded high internal reliability as evidence for high data quality. However, we could not (as we did in Exp. 1) compare reliabilities between high- and low-reputation workers, because we were unable to sample enough low-reputation workers. As an alternative, we decided to compare the reliability of the measures used in this study (SDS, RSES, and NFC) to their conventional coefficients, as reported in the literature: Fischer and Fick (1993) reported a reliability of .86 for the short form of the SDS with a sample of 309 students; Cacioppo et al. (1984) reported a reliability of .90 that was obtained from 527 students; and Robins, Hendin, and Trzesniewski (2001) reported a reliability of .88 for the RSES among 508 students. We compared the reliabilities obtained from our MTurk groups to these scores using the Hakistan and Whalen (1976) test for significance of differences between independent reliability coefficients. In all analyses, we employed the Bonferroni correction method and multiplied the p values by the number of possible comparisons. We found that all groups showed a significantly lower reliability for the SDS than was reported in the literature, χ ²(1) > 15.6, p < .01. However, the reliabilities for the RSES and NFC scales were not significantly lower than those reported in the literature (ps > .05). In fact, for some of the cases, reliabilities were higher than those reported in the literature, especially among high-productivity workers and those who had passed ACQs (see Table 5).

Comparing high- and low-productivity workers, we found that high-productivity workers produced higher reliability scores for the SDS, RSES, and NFC scales (.70, .931, and .951 vs. .576, .910, and .912, respectively). These differences were statistically significant for all three scales, χ ²s(1) = 7.15, 4.23, and 21.27, ps = .0075, .039, and <.001, respectively, suggesting that high-productivity workers produced higher-quality data. When comparing the three groups who had passed, failed, or not received ACQs, we found no statistically significant differences in the reliability scores on the SDS, but we did find statistically significant differences in the RSES and NFC scales, χ ²s(2) = 3.38, 18.84, and 7.61, p = .18, p < .001, and p = .022, respectively. On the two scales that showed statistical differences (RSES and NFC), participants who had passed ACQs showed higher reliability scores than did those who had failed or had not received ACQs (.938 vs. .897 and .888 for the RSES, and .946 vs. .917 and .927 for the NFC scale). However, the scores between those who had failed versus those who did not receive ACQs were not statistically different for either the RSES or the NFC scale, χ ²s(1) = .18 and .42, ps = .67 and 51, respectively.

We then examined whether the effect of adding (novel) ACQs occurred within both high- and low-productivity workers. We compared the reliability scores of the three ACQ groups within each productivity group (which are given in Table 5). Among the low-productivity groups, we found no statistical difference for the SDS, but we did find significant differences for the RSES and the NFC scale, χ ²s(2) = 1.86, 7.66, and 6.74; ps = .39, .02, and .03, respectively. Among the high-productivity groups, we did not find statistical differences for the SDS or the NFC, but we did find significant differences for the RSES, χ ²s(2) = 4.27, 2.56, and 12.62; ps = .12, .28, and .002, respectively. This suggests that the aforementioned overall effect of ACQs was mostly driven by differences among the low-productivity workers.

Social desirability

As in Experiment 1, we regarded lower levels of socially desirable responses as a proxy for higher data quality. We calculated for each participant the percentage of socially desirable responses, according to the SDS (the averages of the SDS percentages are reported in Table 5 for the productivity and ACQ groups). An ANOVA on the SDS mean percent scores with productivity and ACQ conditions showed no statistically significant effect for productivity, ACQ, or their interaction, Fs(1, 2, and 2 [respectively], 531) = 1.3, 1.24, and 1.03, ps = .25, .29, and .36, η ²s = .002, .005, and .004, respectively.

Central-tendency bias

To measure participants’ tendencies to mark the midpoint of the scale, we computed for each participant the relative frequency with which they had marked “3” on the 5-point scales in the TIPI, RSES, and NFC. An ANOVA on this central-tendency bias score showed a significant effect for ACQs, F(2, 531) = 6.04, p = .003, η ² = .022, and no significant effects for the level of productivity, F(1, 531) = 3.38, p = .066, η ² = .006, or the interaction between the two, F(2, 531) = 1.93, p = .15, η ² = .007. Post-hoc comparisons, using Bonferroni’s correction, showed that those who had passed ACQs were less likely to mark the midpoint of the scales than were those who had failed the ACQs (M = 1.81 vs. 0.27, SDs = 0.13 and 0.18, respectively; p = .009, d = 0.31). Respondents who did not receive ACQs showed an average score (M = 0.20, SD = 0.13) that was not significantly different from the other two groups’ scores (p > .05).

Discussion

We found corroborating evidence that high-reputation workers (whether having previously completed many or few HITs) can produce high-quality data. In contrast to Experiment 1, which used familiar ACQs (which may have been ineffective for experienced MTurk workers), in Experiment 2 we employed three novel ACQs. Even using these novel ACQs did not improve data quality among high-reputation workers, replicating the finding from Experiment 1. Together, the findings suggest that sampling high-reputation workers appears to be a sufficient condition for obtaining high-quality data on MTurk. Note that—as in Experiment 1—this conclusion relies on interpreting a null effect as meaningful, which is possible when samples are adequately powered (Greenwald, 1975). Indeed, our sample had a statistical power of more than 80 % to detect differences of at least d = 0.25. The fact that no differences were found suggests that high-reputation workers produce high-quality data, irrespective of ACQs.

Additionally, we also found that workers who were more productive (having completed more than 500 HITs, and sometimes many more than that) were less prone to fail ACQs and, in some respects, produced higher data quality than did less experienced workers, who had completed fewer than 100 HITs. Moreover, ACQs increased data quality to some extent among low-productivity workers, but not among high-productivity workers. This suggests that sampling highly productive high-reputation workers may be the best way to ensure high-quality data without the need for resorting to ACQs. However, one must consider possible drawbacks of including highly productive workers, such as that they might not be totally naïve to the experimental procedure or the questions of the study (see Chandler, Mueller, & Paolacci, 2013, for a discussion of nonnaivety amongst MTurk respondents).

General discussion

Data quality is of utmost importance for researchers conducting surveys and experiments using online participant pools such as MTurk. Identifying reliable methods, which ensure and increase the quality of data obtained from such resources, is thus important and beneficial. In two studies, we found that one way to ensure high-quality data is to restrict the sampling of participants to MTurk workers who have accumulated high ratings from previous researchers (or other MTurk requesters). When sampling such high-reputation workers, data quality—as measured by scales’ reliability, socially desirable responses, central-tendency bias, and the replicability of known effects—was satisfactorily high. In contrast, low-reputation workers seem to pay much less attention to instructions, as indicated by a higher failure rate of ACQs, and thus produced data of lower reliability, exhibited more response biases, and showed smaller effect sizes for well-known effects. Our recommendation is to restrict sampling to high-reputation (and possibly highly productive) MTurk workers only. In our study, we used the arbitrary cutoff of 95 % to differentiate between workers with high or low reputation levels. Researchers may of course use a stricter cutoff, given that the distribution of workers is highly skewed in favor of high-reputation workers.

Whereas the first experiment was, by its nature, exploratory, our findings were corroborated in our second experiment, which also helped overcome Experiment 1’s main limitation—workers’ familiarity with the ACQs used. We found that even when novel and unfamiliar ACQs were used, high-reputation workers showed a high likelihood of passing them (indicating that they do read instructions). In fact, one of the most important findings of our research lies in the null effect that ACQs seemed to have on high-reputation workers. Whether or not ACQs were used, these high-reputation workers provided high-quality data across all of the measures that we employed in our study. Whatever effect ACQs had on MTurk workers was limited to low-reputation workers (Exp. 1) or to workers who were less productive (Exp. 2). Even then, the effect was limited only to some of the cases and some of the measures of data quality. Thus, we conclude that sampling high-reputation workers is not only a necessary, but also a sufficient, condition for obtaining high-quality data. Using ACQs does not seem to help researchers obtain higher-quality data, despite previous emphasis on this approach (e.g., Aust et al., 2013; Buhrmester et al., 2011; Downs et al., 2010; Oppenheimer et al., 2009). Perhaps ACQs were essential a few years ago, but they do not seem to be essential currently.

Sampling high-reputation workers to ensure high data quality without using ACQs provides two advantages. First, when ACQs are used and responses are excluded after data collection, experimental cell sizes may differ, and selection bias may occur. Second, ACQs may cause reactance and hamper the natural flow of a study. We did not find evidence for the second advantage; however, it should be noted that we did not include any measures that were specifically geared toward measuring reactance or survey flow (such as attitudes toward the survey or the researchers).

For our recommendation of not using ACQs, but instead restricting sampling to high-reputation workers, to be beneficial, two conditions must hold. First, it is important that sampling only high-reputation workers not result in sampling bias, which would be the case if high-reputation workers differed from low-reputation workers on dimensions other than paying attention to instructions. In our experiments, we did not find evidence for this being the case, since high- and low-reputation workers showed the same distributions of age and gender. It should be noted, however, that we could not assess potential differences in personality traits, self-esteem, and Need for Cognition scores between high- and low-reputation workers, because the lower reliability scores and higher levels of central-tendency bias among low-reputation workers made it impossible to compare these scores to those of high-reputation workers. Second, it is important that restricting sampling to high-reputation workers not interfere with response rates. In our experiments, we also found no evidence for this being the case. In fact, the sample in Experiment 1 obtained from low-reputation workers after 10 days of data collection was about half the size of the sample obtained from high-reputation workers. In Experiment 2, which was conducted a few months later, we were unable to sample a sufficient number of low-reputation workers for our study. Therefore, it seems that restricting samples to high-reputation workers does not significantly reduce the pool from which workers are sampled, and will only minimally affect the time needed to reach a specified sample size. In the current state of the MTurk population, sampling only high-reputation workers appears to be an effective and efficient method to ensure high data quality on MTurk.

Our experiments also point to a possible phenomenon that might be occurring on MTurk: namely, that the number (or ratio) of low-reputation workers is low, and possibly decreasing. In Experiment 1, we found it harder (and more time consuming) to sample low- than to sample high-reputation workers. In Experiment 2, in which we used an even lower cutoff for low-reputation workers, it was not possible to collect a sufficient number of responses from this subpopulation in the study time frame. This suggests that MTurk’s HITs approval system “weeds out” bad workers (i.e., those who perform poorly and do not satisfy requesters’ needs). If this is true, the entire population of MTurk workers will increasingly consist of only highly reputed and productive workers, which would make MTurk an even more attractive pool for researchers. However, another, and less fortunate, process might be at play. It is possible that requesters are approving HITs more than they should, thereby inflating workers’ reputation levels. As a consequence, reputation levels would become less indicative of high-quality workers, and ACQs might be needed again to differentiate “good” from “bad” workers. Although our experiments do not provide conclusive evidence for one interpretation or the other, our findings do suggest that the first, and more fortunate, process is more probable. Because high-reputation workers generated high-quality data and low-reputation workers did not, reputation levels appear to be a reliable indicator of data quality. Further research will be needed to investigate whether reputation still predicts data quality in the future, or on other crowd-sourcing resources for data collection.

Notes

That is, multiplying the p values by the number of possible comparisons, which was, in this case and in all other analyses reported for this study, ten.
We thank David Tannenbaum for this suggestion.

References

Aust, F., Diedenhofen, B., Ullrich, S., & Musch, J. (2013). Seriousness checks are useful to improve data validity in online research. Behavior Research Methods, 45, 527–535. doi:10.3758/s13428-012-0265-2
Article PubMed Google Scholar
Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6, 3–5. doi:10.1177/1745691610393980
Article Google Scholar
Cacioppo, J. T., Petty, R. E., & Feng Kao, C. (1984). The efficient assessment of need for cognition. Journal of Personality Assessment, 48, 306–307.
Article PubMed Google Scholar
Chandler, J., Mueller, P., & Paolacci, G. (2013). Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers. Behavior Research Methods. doi:10.3758/s13428-013-0365-7. Advance online publication.
Google Scholar
Downs, J. S., Holbrook, M. B., Sheng, S., & Cranor, L. F. (2010). Are your participants gaming the system? Screening Mechanical Turk workers. In Proceedings of the 28th International Conference on Human Factors in Computing Systems (pp. 2399–2402). New York, NY: ACM.
Google Scholar
Fischer, D. G., & Fick, C. (1993). Measuring social desirability: Short forms of the Marlowe–Crowne Social Desirability Scale. Educational and Psychological Measurement, 53, 417–424.
Article Google Scholar
Goodman, J. K., Cryder, C. E., & Cheema, A. (2013). Data collection in a flat world: The strengths and weaknesses of Mechanical Turk samples. Journal of Behavioral Decision Making, 26, 213–224. doi:10.1002/bdm.1753
Article Google Scholar
Gosling, S. D., Rentfrow, P. J., & Swann, W. B., Jr. (2003). A very brief measure of the Big Five personality domains. Journal of Research in Personality, 37, 504–528.
Article Google Scholar
Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82, 1–20. doi:10.1037/h0076157
Article Google Scholar
Hakstian, R. A., & Whalen, T. A. (1976). A k-sample significance test for independent alpha coefficients. Psychometrika, 41, 219–231.
Article Google Scholar
Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology, 45, 867–872.
Article Google Scholar
Paolacci, G., Chandler, J., & Ipeirotis, P. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5, 411–419.
Google Scholar
Robins, R. W., Hendin, H. M., & Trzesniewski, K. H. (2001). Measuring global self-esteem: Construct validation of a single-item measure and the Rosenberg Self-Esteem Scale. Personality and Social Psychology Bulletin, 27, 151–161.
Article Google Scholar
Rosenberg, M. (1979). Rosenberg self-esteem scale. New York, NY: Basic Books.
Google Scholar
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185, 1124–1131. doi:10.1126/science.185.4157.1124
Article PubMed Google Scholar

Download references

Author Note

This research was partially supported by a grant from the NSF (No. 1012763), awarded to Alessandro Acqusiti.

Author information

Authors and Affiliations

Graduate School of Business Administration, Bar-Ilan University, Ramat-Gan, Israel, 52900
Eyal Peer
School of Economics and Management, Tilburg University, Tilburg, The Netherlands
Joachim Vosgerau
Heinz College, Carnegie Mellon University, Pittsburgh, PA, USA
Alessandro Acquisti

Authors

Eyal Peer
View author publications
You can also search for this author in PubMed Google Scholar
Joachim Vosgerau
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Acquisti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eyal Peer.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peer, E., Vosgerau, J. & Acquisti, A. Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behav Res 46, 1023–1031 (2014). https://doi.org/10.3758/s13428-013-0434-y

Download citation

Published: 20 December 2013
Issue Date: December 2014
DOI: https://doi.org/10.3758/s13428-013-0434-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Reputation as a sufficient condition for data quality on Amazon Mechanical Turk

Abstract

Similar content being viewed by others

Evaluating CloudResearch’s Approved Group as a solution for problematic data quality on MTurk

Data quality of platforms and panels for online behavioral research

Fast, Cheap, and Unethical? The Interplay of Morality and Methodology in Crowdsourced Survey Research

Attention checks versus approval ratings

Experiment 1

Method

Sampling and participants

Design

Procedure

Results

Attention check questions

Reliability

Social desirability

Anchoring task

Central-tendency bias

Discussion

Experiment 2

Method

Sampling

Participants

Design

Procedure

Results

Attention check questions

Reliability

Social desirability

Central-tendency bias

Discussion

General discussion

Notes

References

Author Note

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation