Abstract
Carelessness or insufficient effort responding is a widespread problem in online research, with estimates ranging from 3% to almost 50% of participants in online surveys being inattentive. While detecting carelessness has been subject to multiple studies, the factors that reduce or prevent carelessness are not as well understood. Initial evidence suggests that warning statements prior to study participation may reduce carelessness, but there is a lack of conclusive high-powered studies. This preregistered randomized controlled experiment aimed to test the effectiveness of a warning statement and an improved implementation of a warning statement in reducing participant inattention. A study with 812 participants recruited on Amazon Mechanical Turk was conducted. Results suggest that presenting a warning statement is not effective in reducing carelessness. However, requiring participants to actively type the warning statement statistically significantly reduced carelessness as measured with self-reported diligence, even-odd consistency, psychometric synonyms and antonyms, and individual response variability. The active warning statements also led to statistically significantly more attrition and potentially deterred those who were likely to be careless from even participating in this study. We show that the current standard practice of implementing warning statements is ineffective and novel methods to prevent and deter carelessness are needed.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Carelessness or insufficient effort responding (IER) is a frequent problem in online research, with estimates ranging from 3 to 9% (Maniaci & Rogge, 2014), over 10–12% (e.g., Meade & Craig, 2012), and 18–27% (Peer et al., 2017) up to 45.9% (Brühlmann et al., 2020) of participants in surveys exhibiting such behavior. IER can have serious consequences such as failed manipulations (Oppenheimer et al., 2009), false-positive findings (Huang et al., 2015), and unfavorable psychometric properties of scales (Johnson, 2005; Maniaci & Rogge, 2014). Thus, ensuring quality data is vital for on- and offline research. The prevalence of IER in online surveys seems to depend on sample characteristics and the detection methods employed (Curran, 2016; Toich et al., 2021). While much work has focused on identifying and excluding participants exhibiting IER, there are still many open questions on how to effectively reduce IER in online surveys (see e.g., Arthur et al., 2021). There are many influencing factors over which researchers have little control, such as environmental distractions (Carrier et al., 2009), participant–researcher distance (Meade & Craig, 2012), or participant multitasking (Zwarun & Hall, 2014). However, one frequently used technique aiming to reduce carelessness is to make participants aware of the consequences of inattentive responding. For instance, researchers may tell participants that they will withhold incentives if participants respond inattentively or, alternatively, emphasize the importance of valid responses and the meaning of the study.
In Berinsky et al. (2016), a warning statement improved passage rates compared to respondents who went through the survey without a warning by eight percentage points. Huang et al. (2012) found that warning statements reduce the severity of IER. Similarly, Breitsohl and Steidelmüller (2018) demonstrated that informing participants that methods to assess carefulness are used improved scale reliability in a survey. They used the same wording as Huang et al. (2015), who also found positive effects on self-reports from participants when using a warning statement. Bowling et al. (2020) found that participants in a warning condition, compared to a control group, began engaging in careless responding considerably later in the survey. In contrast to the aforementioned studies, Meade and Craig (2012) found that warning statements provided no benefit over using identified responses without the warning and decreased respondent self-reported attitude toward the study. More recently, Toich et al. (2021) found no statistically significant differences between negative and positive warning statements as a means to reduce IER in Amazon’s Mechanical Turk (MTurk) and University Participant Pool samples.
Taken together, there seems to be slightly more evidence for the effectiveness of warning statements to deter insufficient effort responding. In their review, Ward & Meade (2023) concluded that using more positive approaches such as offering rewards for careful responding, have been less effective than warnings. However, the implementation of these warning statements differed substantially across studies, and it is unclear how warnings affect various carelessness detection methods. In Berinsky et al. (2016), participants were informed that each of their responses will be checked and only responses that show that participants have read and understood the survey would be accepted. Huang et al. (2012) mentioned that sophisticated statistical control methods would be used to check the validity of responses and that participants could lose their credits if they are not being attentive. This wording was also used in a slightly adapted form in Bowling et al. (2020). Similarly, but without naming any consequences, Huang et al. (2015) and Breitsohl and Steidelmüller (2018) instructed participants that several methods to assess carefulness and to ensure data quality were used. In the stern warning condition, Meade and Craig (2012) emphasized the importance of the responses and that honesty and thoughtful responses are subject to academic integrity policy.
Thus, it seems that warning statements have the potential to reduce IER, but the concrete implementation of these statements, especially concerning the consequences of inattentive responding, varies greatly between studies. To date, there is no comprehensive study that examines the effects of warning statements on various carelessness detection methods with preregistered hypotheses. Therefore, this study aims to test the general hypothesis that warning statements reduce carelessness in a registered randomized controlled experiment with two variants of warning statement implementations.
Purpose of the present study
In the present study, we examined the effectiveness of warning statements on nine IER measures for participants in an online survey on MTurk. The preregistered experiment was designed in line with the works by Huang et al. (2012), Huang et al. (2015), and Toich et al. (2021), using the 300 items of the International Personality Item Pool-NEO inventory (IPIP-NEO; Goldberg et al., 2006; Goldberg, 1999). The effectiveness of a warning statement with the potential for negative consequences (loss of reimbursement for participation) was tested in a passive and an active condition. In the passive condition, the warning statement was presented to participants before the main part of the questionnaire, while in the active condition, participants were required to write down the statement in a text box, to ensure attentive reading of the statement. Most studies that examined the effects of warning statements merely presented the statement and did not require participants to demonstrate that they had read the warning. We expected that writing down, as a way to increase engagement with the warning statement, makes the warning more likely to be remembered while completing the survey and that this condition would maximize the effectiveness of the warning statement. Thus, we expected that participants who were asked to read the warning statement and those who were additionally required to write down the statement to show less carelessness throughout the survey than in the control condition.
To detect IER, we selected nine methods of those recommended in Curran (2016) and Huang and Wang (2021) namely attention check items (instructed response items, bogus/infrequency items), self-report items, and response time as procedural intervention indices and longstring, Mahalanobis distance, even-odd consistency, psychometric synonym and antonym, and individual response variability as post hoc statistical indices. In line with recommendations in Meade & Craig (2012) and Ward & Meade (2023), we used several different carelessness indices to be able to capture multiple forms of inattentive responding. However, analyzing the effects of the three conditions on nine different carelessness methods may never show a unified picture where conditions differ on all or on none of the methods. Therefore, we aim to answer the following research question:
Research question: Do warning statements reduce participant inattention, as measured by the rate of failed attention check items (instructed response items, bogus items/infrequency items) and the values on self-report items, response time, longstring, Mahalanobis distance, even-odd consistency, psychometric synonym and antonym, and individual response variability?
We propose the following hypotheses (H1a–H9b):
For the two warning statement conditions, we expect, in general, participants in the active condition to show less inattentive behavior than those in the passive condition. We do not expect this effect to be large enough to reach statistical significance.
Preregistration
The confirmatory hypotheses H1a to H9b were registered on OSF (https://osf.io/jeurs/) prior to data collection.
Reporting
We report how we determined our sample size, all data exclusions, all manipulations, and all measures in the study.
Method
Power analysis
We conducted an a priori power analysis with the effect sizes reported in Table 1 in Huang et al. (2012). The effect sizes ranged from d = 0.13 for response time up to d = 0.43 for individual reliability. Using the smallest observed effect size of d = 0.13 (Cohen’s f = 0.065), α = 0.05, and a power of 1-beta = .95, a sample size of 1219 per group (one-way ANOVA) returned. However, a sample size of 3657 was beyond the budget of this research. The funds were limited to a sample of 810 participants in total (about CHF 2000 including fees). Thus, a sensitivity analysis revealed that with an N of 270 per group and a power of 1-beta = .95, a small to medium effect could be detected, f = 0.1383, d = 0.277. With a lower power of 1-beta = .80, and N = 270, a small effect of f = 0.1093, d = 0.216, can be studied. All power analyses were conducted using G*Power 3.1.9.6 (Faul et al., 2009).
Sample
As Agley et al. (2022), Huang et al. (2015), and Toich et al. (2021), we used the crowdsourcing service MTurk to recruit a self-selected sample of participants for this study. Participants were compensated with $1.82 for an estimated 15 min of survey time based on the US federal minimum wage of $7.25/h. To be eligible, participants had to be at least 18 years of age and a resident of the United States. No other inclusion or exclusion criteria were applied on MTurk; participants with low numbers of completed and approved HITs were allowed to participate. Based on the power analysis, a sample of 810 people was targeted. Thus, the stopping rule was defined as follows: Data collection will stop when a sample of 810 non-duplicate responses is reached, which exhausts the research budget.
Instrument
Three procedural intervention indices and six post hoc statistical indices were used to detect careless responding. All indices are presented below.
Attention checks
To identify inattentive respondents, attention check items, such as instructed response items (IRI) and infrequency items is frequently recommended (IF; Curran, 2016, Maniaci & Rogge, 2014, Ward & Meade, 2023). Three instructed response items were included in the survey. These items instruct participants to select a predefined response. The instructions in this survey were “Please leave this item blank”, “Please select 'Moderately Inaccurate' for this item”, and “Please select 'Moderately Accurate' for this item (Brühlmann et al., 2020; DeSimone et al., 2015. For each participant, the number of incorrect responses was used for analysis. Three infrequency items, also referred to as bogus items, from Huang et al. (2015) were included in the survey. Such infrequency items ask participants about most unlikely actions or behavior. The infrequency items in this study were ‘‘I eat cement occasionally,’’ ‘‘I can teleport across time and space,’’ and ‘‘I have never used a computer’’ (Huang et al., 2015). Both the infrequency items and the instructed response items were randomly allocated to one of the six pages, so that each page included one of each type of item. Participants were asked to indicate their level of agreement from very inaccurate (1) to very accurate (5). Responses with (4) ‘Moderately Accurate’ or (5) ‘Very Accurate’ were flagged as inattentive. The number of flags per participant was used for analysis.
Self-report diligence
Nine self-report diligence items (Meade & Craig, 2012) were included at the end of the survey. The series consisted of five true-keyed (e.g., “I carefully read every survey item”) and four false-keyed items (e.g., “I was dishonest on some items”). As with the infrequency items, participants were asked to rate the self-report diligence items from very inaccurate (1) to very accurate (5). After inverting the scores on the false-keyed items, the average of these nine items was used for analysis. Reliability was high with Cronbach’s α = .84, 95% CI [.82, .86], and McDonald's ω = .86, 95% CI [.85, .88].
Response time
Response time per page is by default recorded by the survey software used in this study. Response time can be a valuable measure of participant carelessness, but no agreed-upon cutoff exists (Curran, 2016). Therefore, we averaged the response time for each page of the IPIP inventory for further analysis. To approximate the distribution of the variable to a normal distribution, we used the log of response time for further statistical analyses.
Mahalanobis distance
The Mahalanobis distance is a statistical multivariate outlier analysis that “compares a respondent’s scores to the sample mean scores across all items within a survey” (DeSimone et al., 2015, Mahalanobis, 1936). Moreover, it estimates the multivariate distance between the sample mean of each measure and a participant's responses on the survey items (DeSimone et al., 2015, Mahalanobis, 1936). Hence, Mahalanobis distance allows to detect participant’s careless response behavior when examining the multivariate outliers in each participant's responses due to the assumption that participants responding with insufficient effort present an extreme deviation from normal response patterns (DeSimone et al., 2015). For each participant, the Mahalanobis distance with the 300 items of the IPIP inventory was calculated and used for analysis.
Longstring
An indication for low-quality data are participants' response patterns (i.e., repeated selection of same options) that can be checked using the longstring index (DeSimone et al., 2015). Moreover, consecutive identical responses might display careless responses. The longstring finds application when long surveys with a variety of multidimensional constructs or a mix of positively and negatively worded items are implemented in the survey (DeSimone et al., 2015; Huang and Wang, 2021). The average longstring for each participant was used for further analysis.
Even-odd consistency
This index has been termed by Meade and Craig (2012) and is also known as individual reliability (Jackson, 1976). This method measures unidimensional scales divided by using an even-odd split (Meade & Craig, 2012). Furthermore, a within-person correlation is then conducted based on the split subsets for each, the even and the odd scale.
Psychometric synonym/antonym
These post hoc methods were implemented following Meade and Criag (2012), meaning that a threshold of .60 was used for identifying item pairs with positive correlations for the PS index. This resulted in 352 item pairs. However, no item pairs with correlations below – .60 could be identified for the psychometric antonym index. Instead, we had to deviate from the preregistration and decided to set the threshold at -.144, resulting in 30 item pairs for the analysis, which is the same number of items used by Johnson (2005).
Individual response variability
Dunn et al. (2018) introduced the individual response variability index as a more robust extension of the longstring index. The index is calculated as the within-individual standard deviation of responses represented by item-sets across different constructs before rescoring reversed-coded items (Dunn et al., 2018).
Design
This study was implemented on the online survey platform EFS UniparkFootnote 1. Participants provided informed consent on the first page of the survey; next, they were asked about their age and gender. After the initial demographic survey, participants were randomly allocated to one of the three experimental conditions (see Table 2). In the active warning condition, we asked participants to copy the exact statement into a text area below. We included the statement as an image to prevent simple copy and paste. Thus, participants could not select the text and had to type in the statement. A custom script checked whether the provided text matched with the intended statement. We allowed a small margin of error such as incorrect punctuation or minor spelling errors. Once participants wrote down 98% of the statement, the “continue” button became available. The intention was that participants had to actively engage with the warning statement to improve the strength of the experimental manipulation (i.e., reducing inattention by increasing the salience of consequences). In the passive warning condition, we informed participants that there were mechanisms in place to detect carelessness. We chose to employ for both conditions the same statement as in Toich et al. (2021) for two reasons: First, the statement warns of comparatively severe consequences, which are more directly and stronger worded than in other works (e.g., Berinsky et al., 2016; Brink et al., 2019). Thus, we were aiming to maximize the potential deterrence effect of warning statements. Second, the statement is more appropriate in the context of crowdsourcing than other wordings, such as those in Meade and Craig (2012) ; Huang et al. (2015). In the control condition, we simply stated that the survey begins on the next page.
After displaying the different statements, we presented a total of 306 items over six pages to the participants. Three hundred items stem from the original International Personality Item Pool-NEO inventory (IPIP-NEO; Goldberg et al., 2006; Goldberg, 1999). To detect IER in this part of the survey, we used three instructed response items (Brühlmann et al., 2020; DeSimone et al., 2015) and three infrequency items (Huang et al., 2015). Thus, for each of the six pages, 50 items of the IPIP-NEO and 1 instructed or infrequency item were chosen randomly prior to the study. In the end, each page consisted of a total of 51 items with the addition of one instructed response item per page. These items were rated on a five-point Likert scale from very inaccurate (1) to very accurate (5). All the items within each page and the order of the six different pages were randomized. Due to technical limitations of the survey platform, items were not randomized across different pages. The order of the items and pages presented was saved for each person, this order was used to sort the items for the longstring analysis. Upon completion of this section of the study, we presented participants with nine self-reported diligence items (Meade & Craig, 2012). Lastly, we provided the opportunity to give feedback and then presented a completion code for MTurk. The survey was designed to prevent participants from skipping questions (apart from the instructed response items). The participant flow of the study is depicted in Fig. 1.
Analytic strategy
By convention, we set the significance level to α = .05. Hypotheses 1a and 1b were tested using a negative binomial regression model. For all other hypothesis tests, we run one-way ANOVAs with planned contrasts for the hypotheses variants a and b. For none of the dependent variables, all assumptions of the ANOVA (i.e., no severe outliers, normal distribution of the residuals, homogeneity of variances) were met. Therefore, we decided to confirm all results using Kruskal–Wallis tests with Dunn’s test of multiple comparisons for the planned contrasts. When the results of the nonparametric tests differed from parametric ones, both are reported in the manuscript. Analyses were performed with R version 4.2.2 (R Core Team, 2022).
Results
Of a total of 913 complete responses, 97 were identified as duplicate responses and four were not from the United States. These responses were removed, and a final sample of 812 participants remained. The participants were unevenly distributed to the three conditions (χ2 = 9.668, df = 2, p = .008): There were 289 in the control condition, 294 in the passive warning condition, and 229 in the active warning condition. Participants abandoned the study more frequently when they saw that they were required to write down a text in the active warning condition. In the active warning condition, participants spent more time writing down a text (M = 5.06 min, Mdn = 3.35 min, SD = 6.04 min) than participants who read the warning statement (M = 15.6 s, Mdn = 8 s, SD = 27.1 s) and participants in the control condition (M = 7.99 s, Mdn = 4 s, SD = 27.5 s). The majority were women (n = 421, 51.8% and n = 3, 0.3% non-binary or self-described) and on average 34.8 years old (SD = 10.4). Participants spent more time completing the survey in the active warning condition (M = 19.8 min, Mdn = 16.9 min, SD = 10.5 min) than in the passive warning condition (M = 15.8 min, Mdn = 13.2 min, SD = 9.76 min) and the control condition (M = 15.3 min, Mdn = 12.8 min, SD = 8.93 min). The overall median completion time was 14 min and 14.52 s. In the following, we present results in the order of the hypotheses in Table 1.
Attention check items (H1a and H1b)
Between 37.1% and 48% of the participants missed one or more of the six attention check items. Descriptively, in the active warning condition, participants were more likely to spot the attention check items and respond correctly (62.9%, see Table 3). Thus, a negative binomial model was used to examine the relation of condition and the number of flags on the attention check items. Overall, the variable condition was not a significant predictor of the number of flags, LR = 1.765, df = 2, p = .414. Using dummy-coding for the variable condition, the predictor condition-active-warning was not significant, B = – 0.1038, SEB = 0.1706, p = .543, expB = 0.901, 95% CI [0.645, 1.248]. Likewise, condition-passive-warning was also not a significant predictor in the model, B = 0.1164, SEB = 0.1542, p = .450, expB = 1.123, 95% CI [0.831, 1.521]. Thus, hypotheses H1a, that the number of failed attention check items is lower in the active warning condition than in the control condition, and H1b, that the number of failed attention check items is lower in the passive warning in comparison to the control condition, could not be supported.
Additional procedural intervention indices (H2a to H3b)
Refer to the upper part of Table 4 for descriptive statistics on the procedural intervention indices self-reported diligence and page response time. Boxplots and distributions of these indices are depicted in Fig. 2.
Results are in favor of H2a, the active warning statement increased self-reported diligence, but not in favor of H2b, the passive warning did not increase diligence in comparison to the control condition (see Table 5 and Fig. 3). However, no significant differences between the conditions were observed for page response time (H3a and H3b).
Post hoc statistical indices (H4a to H9b)
The lower part of Table 4 presents descriptive statistics for each of the six post hoc statistical indices longstring, Mahalanobis distance, even-odd consistency, psychometric synonyms, psychometric antonyms, and individual response variability. Distributions and boxplots of these indices are presented in Fig. 2.
Statistical analyses showed significant differences between the active warning and the control condition for psychometric antonyms and individual response variability. Thus, hypotheses H8a and H9a were supported. Additionally, using parametric a priori contrasts, participants in the active warning condition also showed significantly higher even-odd consistency and correlation of psychometric synonyms than the control condition. However, these differences were not statistically significant using Dunn’s nonparametric test, therefore, only partial support for H6a and H7a was found. For the other hypotheses we could not find statistical significance using either parametric a priori contrasts or Dunn's nonparametric test. Refer to Table 5 for an overview of the statistical analyses and their results.
Summary of results
In summary, the passive warning condition (hypotheses variants b) did not show an improvement over the control condition in any of the indices. In the active warning condition, significantly more participants dropped out than in the other conditions once they realized that they were required to write a text. However, of those remaining, a statistically significant reduction in carelessness was observed in self-reported diligence, even-odd consistency, psychometric synonyms, psychometric antonyms, and individual response variability in comparison to the control condition (hypotheses variants a). Refer to Fig. 3 for an overview for the effect size differences between warning conditions and the control condition.
Discussion
Survey data are increasingly collected online and ensuring the quality of the collected data has become an ever-growing concern for researchers (Ward & Meade, 2023). Preventing carelessness before it occurs is preferable to detecting careless respondents and extensive data cleaning procedures. Carelessness in responding to surveys becomes especially important in the case of utilizing a crowdsourced sample, as participants stemming from this sample are compensated monetarily for their responses. Further, this sample is aware of commonly used carelessness detection methods due to their experience in taking surveys. Warning respondents that their responses will be screened for quality and that they may not be compensated for low-quality responses is a commonly used and straightforward prevention method. Various works have shown that such warning statements prior to study participation can reduce carelessness (Berinsky et al., 2016; Breitsohl & Steidelmüller, 2018; Huang et al., 2012; Huang et al., 2015). In a recent review, Ward & Meade (2023) suggested that warning statements could be more effective than informational statements (e.g., about the important contribution to research) in reducing inattentive behavior. However, crowdsourced workers may also be deterred by these warning statements, due to their previous experience. Additionally, other studies have provided no evidence that such warnings could work (e.g., Meade & Craig, 2012). Thus, there is a need to investigate whether such warnings effectively reduce all or some forms of careless responding with preregistered hypotheses and high statistical power.
In the present study, we aimed to fill this gap by studying the effect of warning statements in reducing careless responding in a sample recruited on MTurk. However, we could not replicate the prevention effect of presenting warning statements on the three procedural intervention indices (i.e., self-reported diligence, page response time, and attention check items). Even more, none of the six post hoc statistical indices (i.e., longstring, Mahalanobis distance, even-odd consistency, psychometric synonyms/antonyms, and individual response variability) showed statistically significant improvements for the passive warning condition (i.e., presenting a warning text) compared to the control condition. This finding is important because researchers often implement warnings about the potential loss of compensation into their surveys in the hope that it will dissuade careless behavior (e.g., in Agley et al., 2022). However, results from our study show that they are ineffective in reducing participant carelessness. A plausible explanation is that participants frequently skip such instructions and, thus, do not process the information presented there. It may also be that the non-naïve workers on Mechanical Turk (cf., Chandler et al., 2014) have become accustomed to such warnings, which reduces their potential effectiveness. Thus, presenting a comparatively strongly worded warning had no preventive effect in a sample recruited on MTurk. To account for careless participants often not reading such instructions or statements (e.g., see Oppenheimer et al., 2009), we required them to copy down the warning into a text area in the active warning statement condition. With this procedure, we expected participants to process the statement more thoroughly and, consequently, we should have been able to observe a more substantial effect of this warning on the carelessness indices.
In general, our results support these considerations. Copying down the warning statement into a text area had a positive impact on participant carelessness, specifically in reducing psychometrically inconsistent responding, increasing response variability, even-odd consistency, and self-reported diligence. In contrast, merely presenting a warning statement in the passive condition resulted in no statistically significant improvement. However, it is also important to note that the active warning condition resulted in statistically significantly higher participant attrition compared to the passive warning condition, suggesting that the positive impact on various carelessness indices might be because potentially careless participants were deterred by this method. In the following, we will discuss results for each carelessness index investigated in this study.
Attention check items have the advantage of relatively little ambiguity in scoring, meaning that, for instance, the absence of the instructed response or selecting an impossible or very unlikely response on an infrequency item, are strong indicators of careless responding (Meade & Craig 2012, Ward & Meade, 2023). In our study, participants were not less likely to miss attention check items in the two warning conditions than in the control condition. Descriptively, however, in the active warning condition, more participants were not flagged by any of these check items. Although workers on MTurk are experienced with attention check items (Hauser & Schwarz, 2016), still between 37.1% and 48% of the participants in each condition were flagged by at least one of the attention check items. This suggests that attention check items are still a helpful method for unambiguously detecting careless responding on MTurk. However, the warning statements were ineffective in increasing participants' attention to these items. The same participants who miss responding instructions in items may also skip or disregard warning statements.
We further assessed self-reported diligence, which was statistically significantly higher in the active warning condition than in the control condition (d = 0.18). This does not necessarily mean that these participants were answering more diligently, but it does point to the active warning condition potentially facilitating a heightened experience of diligence. A potential consequence of increased diligence in responding to the items of questionnaires is longer response time, since more in-depth processing of the questionnaire would require more time. However, we observed no statistically significant differences in page response time between the conditions.
The indices designed to identify respondents that are multivariate outliers (Mahalanobis distance) or provide a high number of identical answers (longstring), did not differ significantly between the conditions. However, participants responded more consistently to psychometric synonyms (d = 0.16) and antonyms (d = 0.25) and to the even and odd items (d = 0.17) of the IPIP-NEO in the active warning condition than in the control condition. This shows that the active warning was successful in increasing the consistency of the responses, both in relation to the whole sample (psychometric indices), as well as to the content of the scale (even-odd index). Furthermore, participants in the active warning condition also showed greater response variability as measured by the individual response variability index (d = 0.21). However, this result should be interpreted with caution, because individual response variability may be affected by overly consistent or random responding similarly. Thus, conceptually, greater response variability could also mean more carelessness.
From our study, we do not yet know whether the observed positive effects in the active warning condition are due to the content of the warning statement or the fact that participants were required to write down a statement. This is because writing down the statement in the active warning condition required more time and effort than only reading a warning statement. Similarly to instructional manipulation checks in study 2 by Oppenheimer et al. (2009), the active warning condition required active engagement and correct response to proceed with the survey. The unequal distribution of participants to the three conditions shows that more participants abandoned the survey when they realized that they had to write down a statement. Thus, it may be that the active warning statement led potentially careless participants to abandon the study, that the warning increased the attention of the remaining participants, or both simultaneously. This uncertainty is especially important given the sample we utilized in this study. Experienced crowdworkers will have developed their own mental models in how to avoid carelessness detection as well as which studies to avoid entirely due to the potential lack of compensation should they not fulfill the researcher’s requirements for answers. As such, comparative research in alternative populations, for example a student population, is still necessary to understand the precise effects of active warning statements. In addition, future research should investigate this by controlling for active engagement when providing warning statements. Given the relatively small effect of the active warning statement, adding one or multiple required open questions might be equally effective in deterring and detecting careless responses, similar to Brühlmann et al. (2020).
Taken together, our results show that merely presenting a warning statement to participants on MTurk is not an effective carelessness prevention method. Increasing the engagement with the warning statement by requiring participants to copy down the statement into a text area led to several statistically significant improvements in terms of consistency and variability of responses. This implies that the current standard practice of implementing warning statements is ineffective and novel methods to prevent and deter carelessness, such as our active warning condition, are needed.
Limitations and future directions
The main limitation of this study is that the participants were recruited from MTurk. MTurk is a frequently used service in online research, and several works have studied carelessness with participants from this platform. However, alternative services may provide better quality responses (e.g., Prolific, 2022, see Peer et al., 2022 for a recent comparison).
Further, it is important to note that in line with the power analysis, there is a real possibility that the present study was underpowered for detecting statistically significant differences between the warning and control conditions. The sample size allowed for detecting effect sizes of the magnitude d = 0.216, with a power of 1-beta = .80, so any smaller, but still practically important effect may have not reached statistical significance in this study.
In addition, it is important to note that while the calculation of even-odd consistency was calculated for groups of items belonging to the same facet of the IPIP-NEO, items measuring the same facet were unlikely to appear close together in the survey due to the randomization. This randomization procedure might have depressed the odd-even consistency scores. However, in its original version, the IPIP-NEO presents items in random order (Goldberg, 1999).
For studies running on MTurk, it is frequently recommended (e.g., Keith et al., 2017) that the filter mechanisms be used, which we deliberately decided against to observe more careless responding. It may be that the potential positive effect of the passive warning statement was masked by the sheer number of low-quality responses in all three conditions. Thus, there is a potential to replicate these findings regarding the ineffectiveness of passive warning statements and the potential of active warning statements with different quality-control settings on the platform.
Another limitation is the wording of the warning statement, which was identical to the study by Toich et al. (2021). Studies demonstrating a positive effect of a passive warning used different, arguably less strongly worded, statements. Future research should explore variants of warnings; however, if such a severe statement as the one used in this study is ineffective, it seems unlikely that alternative, less stern, warning statements work. However, while the studied warning was appropriate for MTurk (e.g., similar to Agley et al. 2022), it may not be possible to use such a strongly worded warning, neither passively nor actively, on other participant recruiting platforms because of their terms of service. Such a strongly worded warning might lead to adverse effects, as described by Huang and Wang (2021). Thus, there is a research opportunity to identify informational or warning statements that work in various contexts.
There is a potential to explore more sophisticated deterrence methods than warnings or informational statements in online research. Future research could explore different variants of required engagement to proceed with the study. For instance, it would be interesting to explore whether the content of the warning affected attentive responding or whether the mere act of requiring participants to write a text is enough to reduce careless responding. Further research might also explore other presentations of warnings, such as videos and audio messages, or novel deterrence methods which are currently underexplored.
Conclusion
In conclusion, the present study examined the effectiveness of warning statements in reducing careless responding in a sample recruited on MTurk. Results showed that while passive warning statements had no effect on a range of nine carelessness measures, active warning statements had a small but statistically significant effect on self-reported diligence and improved indices of psychometric consistency, even-odd consistency, and response variability. However, active warning statements did not significantly reduce the number of participants flagged by attention check items or affect page response time. These findings suggest that requiring participants to actively process warning statements may have a small to medium effect on reducing careless responding.
Data availability
All material and data used for this study are available on OSF under https://osf.io/d2vp4/.
Code availability
The analysis script used for this study is available on OSF under https://osf.io/d2vp4/.
References
Agley, J., Xiao, Y., Nolan, R., & Golzarri-Arroyo, L. (2022). Quality control questions on Amazon’s Mechanical Turk (MTurk): A randomized trial of impact on the USAUDIT, PHQ-9, and GAD-7. Behavior Research Methods, 54(2), 885–897. https://doi.org/10.3758/s13428-021-01665-8
Arthur, W., Hagen, E., & George, F. (2021). The lazy or dishonest respondent: Detection and prevention. Annual Review of Organizational Psychology and Organizational Behavior, 8(1), 105–137. https://doi.org/10.1146/annurev-orgpsych-012420-055324
Berinsky, A. J., Margolis, M. F., & Sances, M. W. (2016). Can we turn shirkers into workers? Journal of Experimental Social Psychology, 66, 20–28. https://doi.org/10.1016/j.jesp.2015.09.010
Bowling, N. A., Gibson, A. M., Houpt, J. W., & Brower, C. K. (2020). Will the questions ever end? Person-level increases in careless responding during questionnaire completion. Organizational Research Methods, 24(4), 718–738. https://doi.org/10.1177/1094428120947794
Breitsohl, H., & Steidelmüller, C. (2018). The impact of insufficient effort responding detection methods on substantive responses: Results from an experiment testing parameter invariance. Applied Psychology, 67(2), 284–308. https://doi.org/10.1111/apps.12121
Brink, W. D., Eaton, T. V., Grenier, J. H., & Reffett, A. (2019). Deterring unethical behavior in online labor markets. Journal of Business Ethics, 156, 71–88. https://doi.org/10.1007/s10551-017-3570-y
Brühlmann, F., Petralito, S., Aeschbach, L. F., & Opwis, K. (2020). The quality of data collected online: An investigation of careless responding in a crowdsourced sample. Methods in Psychology, 2, 100022. https://doi.org/10.1016/j.metip.2020.100022
Carrier, L. M., Cheever, N. A., Rosen, L. D., Benitez, S., & Chang, J. (2009). Multitasking across generations: Multitasking choices and difficulty ratings in three generations of Americans. Computers in Human Behavior, 25(2), 483–489. https://doi.org/10.1016/j.chb.2008.10.012
Chandler, J., Mueller, P., & Paolacci, G. (2014). Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers. Behavior Research Methods, 46(1), 112–130. https://doi.org/10.3758/s13428-013-0365-7
Curran, P. G. (2016). Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology, 66, 4–19. https://doi.org/10.1016/j.jesp.2015.07.006
DeSimone, J. A., Harms, P. D., & DeSimone, A. J. (2015). Best practice recommendations for data screening. Journal of Organizational Behavior, 36(2), 171–181. https://doi.org/10.1002/job.1962
Dunn, A. M., Heggestad, E. D., Shanock, L. R., & Theilgard, N. (2018). Intra-individual response variability as an indicator of insufficient effort responding: Comparison to other indicators and relationships with individual differences. Journal of Business and Psychology, 33(1), 105–121. https://doi.org/10.1007/s10869-016-9479-0
Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41(4), 1149–1160. https://doi.org/10.3758/BRM.41.4.1149
Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models. Personality Psychology in Europe, 7(1), 7–28.
Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., & Gough, H. G. (2006). The international personality item pool and the future of public-domain personality measures. Journal of Research in Personality, 40(1), 84–96. https://doi.org/10.1016/j.jrp.2005.08.007
Hauser, D. J., & Schwarz, N. (2016). Attentive Turkers: MTurk participants perform better on online attention checks than do subject pool participants. Behavior Research Methods, 48(1), 400–407. https://doi.org/10.3758/s13428-015-0578-z
Huang, J. L., & Wang, Z. (2021). Careless Responding and Insufficient Effort Responding. In Huang, J. L., & Wang, Z. (Eds), Oxford Research Encyclopedia of Business and Management. Oxford University Press. https://doi.org/10.1093/acrefore/9780190224851.013.303
Huang, J. L., Curran, P. G., Keeney, J., Poposki, E. M., & DeShon, R. P. (2012). Detecting And Deterring Insufficient Effort Responding To Surveys. Journal of Business and Psychology, 27(1), 99–114. https://doi.org/10.1007/s10869-011-9231-8
Huang, J. L., Bowling, N. A., Liu, M., & Li, Y. (2015). detecting insufficient effort responding with an infrequency scale: Evaluating validity and participant reactions. Journal of Business and Psychology,, 30(2), 299–311. https://doi.org/10.1007/s10869-014-9357-6
Jackson, D. N. (1976). The appraisal of personal reliability. Meetings of the Society of Multivariate Experimental Psychology, University Park.
Johnson, J. A. (2005). Ascertaining the validity of individual protocols from Web-based personality inventories. Journal of Research in Personality, 39(1), 103–129. https://doi.org/10.1016/j.jrp.2004.09.009
Keith, M. G., Tay, L., & Harms, P. D. (2017). Systems perspective of MTurk for organizational research: Review and recommendations. Frontiers in Psychology, 8. https://doi.org/10.3389/fpsyg.2017.01359
Mahalanobis, P. C. (1936). On the generalised distance in statistics. Sankhya A, 80(Suppl 1), 1–7 (2018). https://doi.org/10.1007/s13171-019-00164-5
Maniaci, M. R., & Rogge, R. D. (2014). Caring about carelessness: Participant inattention and its effects on research. Journal of Research in Personality, 48, 61–83. https://doi.org/10.1016/j.jrp.2013.09.008
Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data. Psychological Methods, 17(3), 437–455. https://doi.org/10.1037/a0028085
Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology, 45(4), 867–872. https://doi.org/10.1016/j.jesp.2009.03.009
Peer, E., Brandimarte, L., Samat, S., & Acquisti, A. (2017). Beyond the Turk: Alternative platforms for crowdsourcing behavioral research. Journal of Experimental Social Psychology, 70, 153–163. https://doi.org/10.1016/j.jesp.2017.01.006
Peer, E., Rothschild, D., Gordon, A., et al. (2022). Data quality of platforms and panels for online behavioral research. Behavior Research, 54, 1643–1662. https://doi.org/10.3758/s13428-021-01694-3
Prolific. (2022, December 29). Prolific’s attention and comprehension check policy. https://researcher-help.prolific.co/hc/en-gb/articles/360009223553-Prolific-s-Attention-and-Comprehension-Check-Policy. Accessed 29 Dec 2022.
R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing. URL https://www.R-project.org/. Accessed 29 Dec 2022
Toich, M. J., Schutt, E., & Fisher, D. M. (2021). Do you get what you pay for? Preventing insufficient effort responding in MTurk and student samples. Applied Psychology. https://doi.org/10.1111/apps.12344
Ward, M. K., & Meade, A. W. (2023). Dealing with careless responding in survey data: Prevention, identification, and recommended best practices. Annual Review of Psychology, 74. https://doi.org/10.1146/annurev-psych-040422-045007
Zwarun, L., & Hall, A. (2014). What’s going on? Age, distraction, and multitasking during online survey taking. Computers in Human Behavior, 41, 236–244. https://doi.org/10.1016/j.chb.2014.09.041
Funding
Open access funding provided by University of Basel This study was financially supported internally, by the Center for General Psychology and Methodology of the University of Basel.
Author information
Authors and Affiliations
Contributions
F. Brühlmann generated the idea for the study. F. Brühlmann and Z. Memeti jointly programmed the study. F. Brühlmann wrote the analysis code, and S.A.C. Perrig verified the accuracy of those analyses. F. Brühlmann wrote the first draft of the manuscript, and all authors critically edited it. All authors approved the final submitted version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The author(s) declare that there were no conflicts of interest with respect to the authorship or the publication of this article.
Ethical approval
The study protocol was approved by the institutional review board of the Faculty of Psychology of the University of Basel, Switzerland under the number 025-21-1.
Consent to participate
Informed consent was obtained from all individual participants included in the study.
Consent for publication
Consent concerning publication was obtained from all individual participants in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Practice Statement
The data and materials for all experiments are available at https://osf.io/d2vp4/ and all hypotheses were preregistered on https://osf.io/jeurs/.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Brühlmann, F., Memeti, Z., Aeschbach, L.F. et al. The effectiveness of warning statements in reducing careless responding in crowdsourced online surveys. Behav Res 56, 5862–5875 (2024). https://doi.org/10.3758/s13428-023-02321-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3758/s13428-023-02321-z