Experiment 2 employed the same design and measures as Experiment 1, with two exceptions: First, we replaced the ACQs used in Experiment 1 with novel ACQs that we designed ourselves (after soliciting examples from colleagues). Second, in addition to workers’ reputation, we orthogonally manipulated workers’ productivity. Worker productivity refers to the number of HITs that a worker has previously completed on MTurk. Similar to worker reputation (the percentage of approved HITs), MTurk allows researchers to specify how many HITs workers must have previously completed in order to view and complete a proposed HIT. Workers’ productivity levels seem to vary greatly. For example, about half of the participants in Experiment 1 indicated that they had completed more than 250 HITs, and about 10 % said they had completed more than 5,000 HITs. A worker’s productivity―just like a worker’s reputation―may be a predictor of data quality, such that highly productive workers may be more likely to produce high-quality data than less productive workers. That could be the case because (a) highly productive workers are workers who are more intrinsically motivated to complete HITs to the satisfaction of the requester; (b) highly productive workers represent “good” workers that had stayed on MTurk, whereas “bad” workers would drop out over time; and (c) highly productive workers would be more experienced in answering survey questions, and thus produce higher-quality data.
Experiment 2 served three purposes: first, to test whether the findings of Experiment 1 would replicate; second, to test whether novel and unfamiliar ACQs would improve data quality for high-reputation workers; and third, to test whether worker productivity would have the same effect on data quality as worker reputation.
Method
Sampling
Over 10 days, we sampled MTurk workers (who had not taken part in Exp. 1) from the U.S. with either high or low reputation (above 95 % vs. less than 90 % previously approved HITs), and with either high or low productivity levels (more than 500 HITs vs. less than 100 HITs completed). Unlike in Experiment 1, in Experiment 2 we manipulated the factors of both reputation and productivity in such a way that a gap separated the manipulated levels. In this way, we avoided MTurk workers with similar reputation/productivity levels (e.g., 95.1 % vs. 94.9 % approved HITs, or 501 vs. 499 completed HITs) being categorized into different groups. As a consequence, it should be easier to detect actual differences in data quality as a function of worker reputation/productivity. The cutoffs for productivity were chosen on the basis of the distribution of self-reported productivity levels in Experiment 1 (about 25 % indicated that they had completed less than 100, and about 30 % said that they had completed more than 500 HITs).
Sampling was discontinued when an experimental cell had reached about 250 responses, or after 10 days. Although we were able to collect responses from 537 high-reputation workers in less than two days, we only obtained responses from 19 low-reputation workers in 10 days. After two days of very slow data collection, we tried to increase the response rate by reposting the HIT every 24 hours (so that it would be highly visible to these workers) and increasing the offered payment (from 70 to 100 cents for a 10-min survey). Unfortunately, both attempts were unsuccessful and we received only 30 responses from low reputation workers after 10 days. We thus decided to focus only on high-reputation workers and on the impact of productivity levels and ACQs on data quality. The obtained sample size allowed for detecting effect sizes of at least d = 0.25 with a power of about 80 %.
Participants
We collected responses from a total of 537 MTurk workers with high reputation (95 % or above): 268 with low productivity (100 or less previous HITs) and 269 with high productivity (500 or more previous HITs). Both the high- and low-productivity groups included similar ratios of males (61.5 % vs. 58.6 %), χ
2(1) = 0.43, p = .51, but workers from the high-productivity groups were somewhat older than those in the low-productivity group (M
high = 34.36, SD
high = 12.45 vs. M
low = 32.08, SD
low = 12.62), t(499) = 2.04, p = .04, d = 0.18. As expected, high-productivity workers reported having completed a much larger number of HITs than did low-productivity workers (M
high = 10,954.78, SD
high = 38,990.67 vs. M
low = 138.64, SD
low = 151.49), t(499) = 4.38, p < .001, d = 0.39. Interestingly, many workers from the low-productivity group (about 43 %) claimed to have completed more than 100 HITs—a fact that should have prohibited them from taking our HIT. Some (about 24 %) of these claimed to have completed more than 250 HITs—an overreport that can be hardly ascribed to simple oversight or memory error. Lastly, high-productivity workers reported having, on average, a slightly higher ratio of previously approved HITs than low-productivity workers (M
high = 99.40, SD
high = 0.76 vs. M
low = 99.21, SD
low = 1.14), t(434) = 2.1, p = .04, d = 0.20.
Design
The participants in each group were randomly assigned to either receive or not receive (novel) ACQs. As in Experiment 1, we oversampled the condition that included ACQs (in a ratio of about 67:33).
Procedure
As in Experiment 1, MTurk workers were invited to complete a survey about personality for 70 cents. Participants first completed the TIPI, followed by the 10-item version of the SDS, the 10-item version of the RSES, and the 18-item version of the NFC scale. On the last page of the survey, participants were asked to indicate their gender and age and to estimate approximately how many HITs they had completed in the past and how many of those were approved (in contrast to Exp. 1, these questions did not include predefined options, but used an open text box in which participants entered their responses, allowing us to collect more granular data).
Participants in the ACQ conditions were asked to answer three additional questions (three novel ACQs): The first one presented participants with a picture of an office in which six people were seated, and asked them to indicate how many people they saw in the picture. Hidden within a lengthy introduction were instructions to workers to not enter “6,” but instead to enter “7,” to show that they had indeed read the instructions. Any response other than 7 was coded as failing this ACQ. The second new ACQ was embedded in the middle of the NFC scale, in the form of a statement that read: “I am not reading the questions of this survey.” Any response other than “strongly disagree” was coded as failing this ACQ. The last novel ACQ consisted of two questions that asked participants to state whether they “would prefer to live in a warm city rather than a cold city” and whether they “would prefer to live in a city with many parks, even if the cost of living was higher.” Both questions were answered on 7-point Likert scales with the endpoints strongly disagree (1) and strongly agree (7). Participants were instructed, however, not to answer the question according to their actual preferences, but to mark “2” on the first question and then add 3 to that value and use the result (i.e., 5) as an answer to the second question.Footnote 2 Any deviating responses were coded as failing this ACQ.
Results
Attention check questions
As can be seen in Table 4, among those who received the ACQs (about 2/3 of each group), 80.3 % of the high-productivity workers passed all of them, relative to 70.9 % of the low-productivity workers, χ
2(3) = 12.63, p = .006. As in Experiment 1, we classified workers of each productivity group according to whether they had passed all ACQs, had failed at least one of the ACQs, or did not receive ACQs at all (see Table 5 for the groups’ sizes).
Table 4 Rates of passing/failing unfamiliar ACQs in Experiment 2
Table 5 Internal reliability and social desirability scores for the groups in Experiment 2
Reliability
As in Experiment 1, we regarded high internal reliability as evidence for high data quality. However, we could not (as we did in Exp. 1) compare reliabilities between high- and low-reputation workers, because we were unable to sample enough low-reputation workers. As an alternative, we decided to compare the reliability of the measures used in this study (SDS, RSES, and NFC) to their conventional coefficients, as reported in the literature: Fischer and Fick (1993) reported a reliability of .86 for the short form of the SDS with a sample of 309 students; Cacioppo et al. (1984) reported a reliability of .90 that was obtained from 527 students; and Robins, Hendin, and Trzesniewski (2001) reported a reliability of .88 for the RSES among 508 students. We compared the reliabilities obtained from our MTurk groups to these scores using the Hakistan and Whalen (1976) test for significance of differences between independent reliability coefficients. In all analyses, we employed the Bonferroni correction method and multiplied the p values by the number of possible comparisons. We found that all groups showed a significantly lower reliability for the SDS than was reported in the literature, χ
2(1) > 15.6, p < .01. However, the reliabilities for the RSES and NFC scales were not significantly lower than those reported in the literature (ps > .05). In fact, for some of the cases, reliabilities were higher than those reported in the literature, especially among high-productivity workers and those who had passed ACQs (see Table 5).
Comparing high- and low-productivity workers, we found that high-productivity workers produced higher reliability scores for the SDS, RSES, and NFC scales (.70, .931, and .951 vs. .576, .910, and .912, respectively). These differences were statistically significant for all three scales, χ
2s(1) = 7.15, 4.23, and 21.27, ps = .0075, .039, and <.001, respectively, suggesting that high-productivity workers produced higher-quality data. When comparing the three groups who had passed, failed, or not received ACQs, we found no statistically significant differences in the reliability scores on the SDS, but we did find statistically significant differences in the RSES and NFC scales, χ
2s(2) = 3.38, 18.84, and 7.61, p = .18, p < .001, and p = .022, respectively. On the two scales that showed statistical differences (RSES and NFC), participants who had passed ACQs showed higher reliability scores than did those who had failed or had not received ACQs (.938 vs. .897 and .888 for the RSES, and .946 vs. .917 and .927 for the NFC scale). However, the scores between those who had failed versus those who did not receive ACQs were not statistically different for either the RSES or the NFC scale, χ
2s(1) = .18 and .42, ps = .67 and 51, respectively.
We then examined whether the effect of adding (novel) ACQs occurred within both high- and low-productivity workers. We compared the reliability scores of the three ACQ groups within each productivity group (which are given in Table 5). Among the low-productivity groups, we found no statistical difference for the SDS, but we did find significant differences for the RSES and the NFC scale, χ
2s(2) = 1.86, 7.66, and 6.74; ps = .39, .02, and .03, respectively. Among the high-productivity groups, we did not find statistical differences for the SDS or the NFC, but we did find significant differences for the RSES, χ
2s(2) = 4.27, 2.56, and 12.62; ps = .12, .28, and .002, respectively. This suggests that the aforementioned overall effect of ACQs was mostly driven by differences among the low-productivity workers.
Social desirability
As in Experiment 1, we regarded lower levels of socially desirable responses as a proxy for higher data quality. We calculated for each participant the percentage of socially desirable responses, according to the SDS (the averages of the SDS percentages are reported in Table 5 for the productivity and ACQ groups). An ANOVA on the SDS mean percent scores with productivity and ACQ conditions showed no statistically significant effect for productivity, ACQ, or their interaction, Fs(1, 2, and 2 [respectively], 531) = 1.3, 1.24, and 1.03, ps = .25, .29, and .36, η
2s = .002, .005, and .004, respectively.
Central-tendency bias
To measure participants’ tendencies to mark the midpoint of the scale, we computed for each participant the relative frequency with which they had marked “3” on the 5-point scales in the TIPI, RSES, and NFC. An ANOVA on this central-tendency bias score showed a significant effect for ACQs, F(2, 531) = 6.04, p = .003, η
2 = .022, and no significant effects for the level of productivity, F(1, 531) = 3.38, p = .066, η
2 = .006, or the interaction between the two, F(2, 531) = 1.93, p = .15, η
2 = .007. Post-hoc comparisons, using Bonferroni’s correction, showed that those who had passed ACQs were less likely to mark the midpoint of the scales than were those who had failed the ACQs (M = 1.81 vs. 0.27, SDs = 0.13 and 0.18, respectively; p = .009, d = 0.31). Respondents who did not receive ACQs showed an average score (M = 0.20, SD = 0.13) that was not significantly different from the other two groups’ scores (p > .05).
Discussion
We found corroborating evidence that high-reputation workers (whether having previously completed many or few HITs) can produce high-quality data. In contrast to Experiment 1, which used familiar ACQs (which may have been ineffective for experienced MTurk workers), in Experiment 2 we employed three novel ACQs. Even using these novel ACQs did not improve data quality among high-reputation workers, replicating the finding from Experiment 1. Together, the findings suggest that sampling high-reputation workers appears to be a sufficient condition for obtaining high-quality data on MTurk. Note that—as in Experiment 1—this conclusion relies on interpreting a null effect as meaningful, which is possible when samples are adequately powered (Greenwald, 1975). Indeed, our sample had a statistical power of more than 80 % to detect differences of at least d = 0.25. The fact that no differences were found suggests that high-reputation workers produce high-quality data, irrespective of ACQs.
Additionally, we also found that workers who were more productive (having completed more than 500 HITs, and sometimes many more than that) were less prone to fail ACQs and, in some respects, produced higher data quality than did less experienced workers, who had completed fewer than 100 HITs. Moreover, ACQs increased data quality to some extent among low-productivity workers, but not among high-productivity workers. This suggests that sampling highly productive high-reputation workers may be the best way to ensure high-quality data without the need for resorting to ACQs. However, one must consider possible drawbacks of including highly productive workers, such as that they might not be totally naïve to the experimental procedure or the questions of the study (see Chandler, Mueller, & Paolacci, 2013, for a discussion of nonnaivety amongst MTurk respondents).