Preregistration
Key aspects of this study, including measures, hypotheses, and study design, were preregistered using the Open Science Framework (OSF) Registration platform (Agley et al., 2020).
Participants
Sample size
We recruited 1100 participants (with replacement in some arms, see Design). Our a priori power analysis indicated that using a fixed effects ANOVA to detect an overall difference in means between four study arms, this sample would allow detection of a difference with effect size f = 0.10 (F = 2.61) at power 0.80, two-tailed alpha 0.05. With equal allocation, each arm was planned to have 275 subjects.
Recruitment
Subjects were recruited on November 20–21, 2020, using Amazon’s MTurk platform. We used MTurk specifications similar to those that we successfully used in our own prior research (Agley & Xiao, 2020). Eligible workers were required to: claim U.S. residency, have a successful task completion rate > 95%, have completed a minimum of 100 tasks, and have completed a maximum of 10,000 tasks. In addition, workers must be age 18 or older to join MTurk, setting a default minimum age for the study.
Compensation
Participants were paid $1.10 USD upon successful completion of the study but were not paid if they failed quality control checks. The informed consent statement warned participants: “This survey may include checks to screen out bots and individuals who are not eligible to take the survey. If you are screened out in this manner, the survey will end, and you should return the HIT in order to avoid your work being rejected.” In addition, the checks (see Table 1) all were in the first section of the study to avoid uncompensated data collection.
Table 1 Quality control (intervention) information
Ethics approval and consent to participate
This study was reviewed by the Indiana University Institutional Review Board (IRB) prior to being conducted (#2011696997). All participants digitally indicated consent but were not informed that they would be randomized to different arms, nor that the purpose of the study was to assess the effects of data quality-control techniques. The only statement describing the study content within the SIS was, “This study will ask a series of questions about your mood, whether you have felt anxious recently, and your alcohol consumption.” A waiver for this incomplete informed consent was approved as part of the IRB review.
Measures
Three screening tools were completed in each of the four arms: the USAUDIT (ten questions), PHQ-9 (nine questions), and GAD-7 (seven questions) (Higgins-Biddle & Babor, 2018; Kroenke et al., 2001; Spitzer et al., 2006). In those and numerous other studies, each screening tool has been validated and found reliable for self-administration in adult populations; unique scoring instructions are described in the cited studies as well, so we computed summed continuous variables for each instrument according to the established rules, with allowable ranges for USAUDIT [0 to 46], PHQ-9 [0 to 27], and GAD-7 [0 to 21].
Participants’ self-reported sociodemographic characteristics were collected for gender, ethnicity, race, age, and highest completed level of education. Question wording for each metric is available on OSF (Agley et al., 2020).
Procedures
This was a single-stage, randomized controlled trial with equal allocation to each study arm (1:1:1:1 allocation). The size of each arm was fixed at the point of sampling but varied slightly for analysis (see Data cleaning). The intervention was a set of exclusion criteria for data collection procedures that varied by study arm (see Table 1 for details and rationale).
-
Arm 1 was a control arm that contained no additional quality-control procedures beyond the standard eligibility requirements (see Participants).
-
Arm 2 was a bot/VPN check that asked participants to select the telephone number they call in the event of an emergency and to correctly identify a drawn image of an eggplant.
-
Arm 3 was a truthfulness/attention check that asked participants if they had done business with Latveria (a country that exists only in the Marvel Comic Universe) and then required them to pass two attention-check questions.
-
Arm 4 was a stringent arm that combined the checks from Arms 2 and 3.
Allocation and storage
The allocation sequence was managed using the Randomizer tool in Qualtrics (Qualtrics, 2020). Allocation concealment was ensured because the procedure was automated and occurred after consent was processed. All data were collected and stored using the Qualtrics XM platform, which enables direct export to multiple formats (e.g., CSV, Excel, SPSS).
Data cleaning
Core concepts
With MTurk, workers are not paid directly by researchers, but are instead provided with a unique random ID, which they enter into Amazon’s platform for verification. Thus, researchers must resolve discrepancies between the local list of IDs and the list submitted by workers for payment. In some cases, fraudulent ID submission may require a small number of additional surveys to be fielded, which was the case here.
We paid for 1100 workers to complete our survey. In theory, the CONSORT flow diagram for this trial would look similar (Fig. 1). However, the separation between MTurk and Qualtrics (the survey platform) meant that there was an intermediary data-cleaning step that occurred while the study still was “in the field.” Specifically, several things could be true or false in each submitted case:
-
Workers could (a) file for payment (submit their random ID generated by Qualtrics to MTurk for review) or (b) not.
-
Workers could (a) submit their survey to Qualtrics, or (b) they could close the survey or Internet browser window before submitting it. Importantly, survey submission occurred at termination of the study, which occurred either when the questionnaire was fully completed or when the worker failed a quality control section. This meant that users could submit, or fail to submit, their responses in Qualtrics regardless of whether they were screened out by quality control measures or successfully finished the study. Thus, a “submitted survey” was different from an “unsubmitted survey,” and both were different from a “usable survey” where a participant successfully reached the questionnaire and submitted it.
-
Workers could (a) submit a real ID provided by the study or could (b) submit a fake random ID, either by guessing based on IDs used by other studies, or by learning the ID pattern from an MTurk forum or website, though this occurs infrequently (Chandler et al., 2014).
Midstream assessment of the data
When 1100 workers had filed for payment, we had 1091 submissions eligible for payment, 1110 usable surveys, 1391 submitted surveys (including one refusal), and 181 unsubmitted surveys. We prepared a diagram to illustrate the computation of these numbers (see Fig. 2). At this point, we also cross-checked frequencies to validate the extant dataset (see the first portion of the analytic syntax in Attachment 1 and timestamped partial data in Attachment 2).
Ideally, randomization would occur after quality control checks, but the nature of the study, where the intervention was the quality control check, required randomization beforehand. Further, the rapid pace of response submission for a study on MTurk frequently meant that multiple people could be sorted into an arm when the quota was almost full, resulting in slight overage for that arm. This issue was compounded by payment claim discrepancies. Thus, as shown in Fig. 2, there was some variability in arm sizes. We opted not to alter the quotas or otherwise externally influence the random assignment. In making this decision, we considered that our primary hypothesis would be tested with ANOVA, which has been suggested to be fairly robust even to moderate deviations in statistical assumptions (Blanca et al., 2017).
Finalizing data collection
To reach our planned recruitment of 1100 paid subjects, we re-opened the survey for a brief period for nine more participants, with random assignment to Arms 3 and 4 (since Arms 1 and 2 were full). There were no anomalies at the payment claim review stage, meaning we obtained nine more usable surveys. Technically, those nine subjects had a different allocation chance (0:0:1:1), but sensitivity analyses (see supplemental files) that excluded those subjects did produce different study outcomes, so the data were retained. We also had 15 additional submitted surveys (who failed the quality check) and seven more unsubmitted surveys, bringing the total of submitted surveys to 1415 (+24), and the number of unsubmitted surveys to 188 (+7).
Incorporating unsubmitted surveys
Unsubmitted surveys were merged with the dataset in the arm to which they were assigned using a binary decision heuristic. First, unsubmitted surveys for which the last answer provided before exiting was for a quality control question were considered to have been rejected from that arm. Second, unsubmitted surveys for which the last answer provided before exiting was not for a quality control question were considered to represent a participant who dropped out of the study (e.g., partial completion or non-response) for a reason unrelated to failing a quality control question. Thus, this dataset included 29 non-respondents (of whom 19 dropped out before the questionnaire, and ten dropped out during the USAUDIT; none dropped out during the PHQ-9 or GAD-7). We also assigned 20 additional rejections for Arm 2, 65 additional rejections for Arm 3, and 74 additional rejections for Arm 4. The final distribution of data is shown in Fig. 3, and the final analytic sample was 1119 usable surveys.
Analyses
Data were analyzed, separately by screening tool, using analysis of variance (ANOVA), with study arm set as the independent variable and outcome score set as the dependent variable. Post hoc bivariate comparisons between study arms used Tukey’s HSD. In addition, reviewers requested exploratory analyses not present in the protocol. First, bivariate correlations between each screening tool were computed separately by study arm using Pearson correlation coefficients to verify that established correlations between these tools remained present in this study. Second, differences in sociodemographic variables were assessed across study arms using either Fisher’s exact tests with Monte Carlo estimation, or ANOVA, dependent on variable type.
Graphing software was utilized to generate visual distribution plots of the scores for each group to inspect differences in dispersion metrics (standard deviation, skewness, and kurtosis). Levene’s test of equality of variances based on medians (Nordstokke & Zumbo, 2007) was used to determine whether there was evidence of significant heterogeneity of variance between arms. To reduce missingness, respondents were required to answer each item on each page of the survey before proceeding. Thus, although we planned to analyze missing data using multiple imputation, the study structure was set up in such a way that almost no missingness was present.
Results
A total of 1603 workers registered for the survey task on MTurk. Of those, one refused consent and 29 dropped out of the study. An additional 55 workers were rejected from Arm 2, 189 were rejected from Arm 3, and 210 were rejected from Arm 4, yielding the analytic sample of 1119 (which included 19 usable surveys where the worker did not submit to MTurk for payment; see Fig. 3). Sociodemographic characteristics of that sample are provided in Table 2. Samples in each arm tended to be slightly more male than female (54.4–57.8% male), except in Arm 3 (51.1% female). Participants were between 11.1% (Arm 3) and 20.1% (Arm 2) Hispanic/Latino, predominantly White (between 71.1% in Arm 4 and 79.2% in Arm 2), and each arm had generally normal distributions of education centered on bachelor’s degree. The mean age of respondents was narrowly bound across study arms, from 38.4 years to 39.9 years. These characteristics were relatively uniform between study arms for race, age, and education level. However, some significant differences were observed for ethnicity (p = .008) and gender (p = .025). Self-identified Hispanics were somewhat underrepresented in Arms 3 and 4. Further, small differences were observed in self-reported transgender and other gender-identity among arms, and self-identified females appeared more prevalent in Arm 3. Screening scores and analytic results are described subsequently, and are provided in Tables 3 and 4, respectively.
Table 2 Sociodemographic characteristics by study arm Table 3 Screening scores by study arm Table 4 ANOVA and Tukey HSD post hoc test scores USAUDIT
The USAUDIT displayed good-to-excellent scale reliability, ranging from α = .884 in Arm 4 to α = .910 in Arm 1. Scores generally decreased from Arm 1 to Arms 3 and 4. Participants in Arm 1 reported a mean score of 13.64 (SD = 10.20), those in Arm 2 reported a mean score of 12.08 (SD = 9.88), those in Arm 3 reported a mean score of 9.14 (SD = 8.39), and participants in Arm 4 reported a mean of 9.33 (SD = 8.10). There was a significant overall difference in USAUDIT scores between the four study arms (p < .001). Post hoc testing using Tukey’s HSD identified that Arm 1 had a higher USAUDIT mean than Arm 3 (+ 4.51, 95% CI + 2.49 to + 6.52) and Arm 4 (+ 4.31, 95% CI + 2.31 to + 6.31), and that Arm 2 had a higher USAUDIT mean than Arm 3 (+ 2.94, 95% CI + 0.93 to + 4.94), and Arm 4 (+ 2.74, 95% CI + 0.75 to + 4.74).
GAD-7
The GAD-7 displayed excellent scale reliability, ranging from α = .926 in Arm 3 to α = .933 in Arms 1 and 4. Scores decreased with each subsequent arm. Participants in Arm 1 reported a mean score of 8.56 (SD = 6.17), those in Arm 2 reported a mean score of 7.33 (SD = 5.87), those in Arm 3 reported a mean score of 6.67 (SD = 5.74), and participants in Arm 4 reported a mean of 6.24 (SD = 5.63). There was a significant overall difference in GAD-7 scores between the four study arms (p < .001). Post hoc tests found that the mean GAD-7 score in Arm 1 was higher than the score in Arm 3 (+ 1.89, 95% CI + 0.61 to + 3.17) and Arm 4 (+ 2.32, 95% CI +1.05 to + 3.59).
PHQ-9
The PHQ-9 also displayed excellent scale reliability, ranging from α = .919 in Arm 3 to α = .938 in Arm 1. As with the GAD-7, scores decreased with each subsequent arm. Respondents in Arm 1 reported a mean score of 10.24 (SD = 7.77), those in Arm 2 reported a mean of 8.54 (SD = 7.18), respondents in Arm 3 reported a mean score of 7.19 (SD = 6.60), and those in Arm 4 reported a mean of 7.10 (SD = 6.78). There was a significant overall difference in PHQ-9 scores between the four study arms (p < .001). Post hoc testing identified a significantly higher mean PHQ-9 score for Arm 1 than Arm 2 (+ 1.69, 95% CI + 0.16 to + 3.22), Arm 3 (+ 3.04, 95% CI + 1.49 to + 4.60), and Arm 4 (+ 3.14, 95% CI + 1.59 to + 4.68).
Differences in dispersion
Levene’s tests based on the median (Nordstokke & Zumbo, 2007) clearly indicated heterogeneous dispersion across study arms for the USAUDIT (F = 10.685, p < .001) and PHQ-9 (F = 8.525, p < .001), and suggested heterogeneous dispersion for the GAD-7 (F = 2.681, p = .046). Although we originally proposed comparisons of standard deviation between arms, our prespecified visual inspection identified SD as a less useful metric than skewness and kurtosis in understanding these data (nonetheless, SD data are available through the supplemental files). Most notably, positive skewness was more evident in Arms 3 and 4 than in Arm 1 for each screening tool, with Arm 2 situated as a linear midpoint in skewness between Arm 1 and Arm 3. Visuals of skewness and kurtosis are available in Figs. 4 and 5.
Correlations between screening tools
Table 5 contains bivariate correlation coefficients for each pair of screening tools separated by study arm. All analyses for each bivariate pair were statistically significant for each study arm (p < .001), as expected. However, there was some heterogeneity by arm in correlation coefficients, which ranged from r = 0.536 (Arm 3) to r = 0.580 (Arm 2) for the USAUDIT/PHQ-9 comparisons, from r = 0.403 (Arm 3) to r = 0.524 (Arm 1) for the USAUDIT/GAD-7 analyses, and from r = 0.855 (Arm 3) to r = 0.918 (Arm 1) for PHQ-9/GAD-7 comparisons.
Table 5 Correlations between screening scores by study arm