Amazon Mechanical Turk in Organizational Psychology: An Evaluation and Practical Recommendations
Amazon Mechanical Turk is an increasingly popular data source in the organizational psychology research community. This paper presents an evaluation of MTurk and provides a set of practical recommendations for researchers using MTurk.
We present an evaluation of methodological concerns related to the use of MTurk and potential threats to validity inferences. Based on our evaluation, we also provide a set of recommendations to strengthen validity inferences using MTurk samples.
Although MTurk samples can overcome some important validity concerns, there are other limitations researchers must consider in light of their research objectives. Researchers should carefully evaluate the appropriateness and quality of MTurk samples based on the different issues we discuss in our evaluation.
There is not a one-size-fits-all answer to whether MTurk is appropriate for a research study. The answer depends on the research questions and the data collection and analytic procedures adopted. The quality of the data is not defined by the data source per se, but rather the decisions researchers make during the stages of study design, data collection, and data analysis.
The current paper extends the literature by evaluating MTurk in a more comprehensive manner than in prior reviews. Past review papers focused primarily on internal and external validity, with less attention paid to statistical conclusion and construct validity—which are equally important in making accurate inferences about research findings. This paper also provides a set of practical recommendations in addressing validity concerns when using MTurk.
KeywordsAmazon Mechanical Turk MTurk Validity Best practices Recommendations
Internet Freelancing—or eLancing—has become an increasingly popular work arrangement among Internet users worldwide (Aguinis and Lawal 2012, 2013). eLancing has been commended as “an ideal natural environment to conduct experiments because researchers are able to use real people and real tasks in a controlled environment” (Aguinis and Lawal 2012, p. 497), balancing experimental control with a naturalistic setting. eLancing also offers other important benefits to researchers, including ease of access, convenience, low costs, and high efficiency. However, eLancing also comes with its own set of challenges that, if improperly handled, can threaten the quality of a research study.
Amazon’s Mechanical Turk (MTurk) is one of the more widely used eLancing (or crowdsourcing) options in the behavioral sciences and, as we show below, it is increasingly used within the organizational psychology research community. At the same time, a substantial body of academic literature has emerged that scrutinizes issues related to the use of MTurk for research purposes (Bergvall-Kareborn and Howcroft 2015), including validity, reliability, data quality, and generalizability (e.g., Chandler et al. 2014; Paolacci and Chandler 2014; Rouse 2015). It is important to note that the issues being deliberated are not exclusively applicable to MTurk samples, they apply to any type of sample gathered for experimental, quasi-experimental, and nonexperimental (e.g., survey) studies. But the increased use of MTurk’s platform for recruiting research participants has led researchers to more carefully examine the nature of samples and consider other methodological concerns. This is likely because MTurk has become a sizable crowd employment platform used not only by academic researchers, but also by large corporations and start-up companies (Bergvall-Kareborn and Howcroft 2015).
Within the organizational psychology community, Landers and Behrend (2015) recently discussed modern-day convenience sampling methods, highlighting both the advantages and disadvantages of using online panels and crowdsourcing samples (e.g., MTurk), and calling for additional discussion and investigation into the merits and weaknesses of online convenience samples such as those gathered through MTurk. We view the recent stream of MTurk discussions as an opportunity to bring methodological concerns to the fore and highlight recommendations to overcome or minimize such concerns when conducting research with MTurk samples.
We first provide an overview of MTurk, including its advantages and disadvantages as discussed in previous papers. Second, we review how MTurk samples are being used in top industrial–organizational (IO) psychology journals, including the use (or lack thereof) of quality control procedures. Third, we present an evaluation of ten methodological concerns related to the use of MTurk and their threats to the different types of validity evidence in Shadish et al.’s (2002) validity typology. Multiple papers (e.g., Horton et al. 2011; Mason and Suri 2012; Paolacci et al. 2010) have suggested that MTurk can overcome internal and external validity concerns based on the availability of random assignments in experimental settings and the diversity (or representativeness) of the MTurk population. However, there are additional concerns and limitations related to MTurk that can undermine internal and external validity. Additionally, our evaluation will also discuss issues related to statistical conclusion validity and construct validity, which have not been addressed in previous MTurk reviews, but are nonetheless important.
A summary of methodological concerns, validity threats, and recommendations
1. Subject inattentiveness
Internal, statistical conclusion, construct
Detect and screen inattentive responses
Use attention check items fairly and offer second chances to MTurk Workers
2. Selection biases
Consider the extent to which self-selection may affect the validity of findings in light of research objectives
3. Demand characteristics
Actively monitor MTurk forums
Avoid cues signaling study aims and eligibility criteria
Measure participant motivation
4. Repeated participation
Employ steps including data screening and MTurk system and customized qualifications
5. Range restriction
Justify necessary qualification requirements in recruiting MTurk Workers
6. Consistency of treatment and study design implementation
Minimize inconsistencies in study implementations. If study features are designed to be different, incorporate those components into final analyses
7. Extraneous factors
Internal, statistical conclusion, construct
Identify, measure, and include possible sources of extraneous factors into data analyses, especially those common to MTurk participation
Proactively instruct MTurk Workers to minimize extraneous factors
8. Sample representativeness and appropriateness
Ensure that the characteristics of the obtained sample are as close as possible to those of the targeted population
Understand the demographic characteristics of the MTurk participant pool and determine whether MTurk is an appropriate data source
9. Consistency between construct explication and study operations
Evaluate the appropriateness of MTurk samples in relation to the explication of measured constructs
10. Method bias
Measure and control for method effects arising from MTurk samples
Amazon Mechanical Turk: An Overview
MTurk has been used widely by social science researchers (Behrend et al. 2011) to recruit participants for experimental (e.g., Crump et al. 2013; Horton et al. 2011; Sprouse 2011) and observational research (e.g., Buhrmester et al. 2011). Created and administered by Amazon, MTurk is an online labor market that assists Requesters (those who recruit and pay Workers, e.g., researchers) with hiring and compensating Workers (those who complete work and get compensated) to complete a variety of tasks (e.g., transcription, survey, tagging, and writing). Each Worker possesses systemqualifications assigned by the MTurk system and customized qualifications assigned by Requesters. Requesters may use qualification requirements to determine which Workers can or cannot participate in a task. In MTurk, each task is labeled as a HIT, which stands for Human Intelligence Task; a HIT represents a single assignment that Workers can work on, submit a response, and be rewarded for completion. By default, each Worker can only work on a HIT once. Requesters are given the option to approve or reject a HIT submitted by Workers, and Workers are rewarded with compensation only if their HITs are approved.
Before Requesters publish their HITs, Requesters can choose their qualification requirements. An example of system-assigned qualification is Master Qualification. According to mturk.com, Master Workers are those who have demonstrated exceptional performance and high levels of accuracy while completing HITs for a variety of Requesters on the MTurk marketplace. Masters must maintain their performance level and pass MTurk’s regular statistical monitoring to maintain their Master status. Although Masters are considered as “better” or “preferred” Workers, there are issues (e.g., repeated participations) related to Master status that we will discuss below. Other system qualifications include approval ratings (percentage of approved HITs), number of approved HITs, and location. Requesters may also create customized qualifications that are suitable for or specific to their research to prescreen Workers. Some customized qualifications common to organizational researchers include employment status, income level, industry, gender, and personal experiences at work (e.g., victims of workplace harassment). There are several options for Requesters to create their own qualifications, including MTurk’s Web interface, Command Line Tools, and Amazon Web Services API. For more information on qualification assignment, readers are encouraged to consult Chandler et al. (2014). For more information on the process of becoming a Requester, readers can refer to Amazon’s basic user guide for Requesters (Welcome to Requester Help, n.d.).
Studies have found that MTurk holds promise for conducting research in the social sciences. For instance, Buhrmester et al. (2011) found that data provided by MTurk participants had satisfactory psychometric properties comparable to characteristics of published studies. The ability to recruit from diverse backgrounds can also alleviate the concerns regarding the oversampling of participants from WEIRD (Western, Educated, Industrialized, Rich, and Democratic) backgrounds (Henrich et al. 2010; Landers and Behrend 2015). In addition, Horton et al. (2011) found that experiments conducted on MTurk were as valid (both internally and externally) as other kinds of experiments (i.e., laboratory and field experiments), while reducing researcher time, costs, and inconvenience.
However, some researchers have questioned the identity and motives of Workers because they are willing to complete MTurk tasks for small amounts of pay (e.g., Paolacci and Chandler 2014). Although Workers come from diverse populations and backgrounds, some have criticized that Workers are by no means representative of most populations of interest because they tend to be Internet users who self-select into the eLancing work environment, and systematic differences between Internet and non-Internet users are often overlooked.
According to other researchers, MTurk Workers, compared to the general population, tend to be younger, underemployed, more liberal, and less religious (Berinsky et al. 2012; Shapiro et al. 2013). Additionally, Berinsky et al. (2012) and Roulin (2015) found that Whites and Asians are overrepresented, whereas Blacks and Hispanics tend to be underrepresented, on MTurk as compared to the general U.S. population and the U.S. workforce. This lack of representativeness, however, is not limited to MTurk samples. It is also a common concern in other convenience sampling methods used in organizational psychology, including organizational samples, college student samples, and snowball samples (Landers and Behrend 2015). Such sampling issues highlight that MTurk, as with any approach using convenience sampling, should be evaluated relative to its appropriateness for answering specific research questions (Zhu et al. 2015).
The Use of MTurk in IO Psychology
To examine the trends of how MTurk samples are being used, we conducted a literature search within 20 top IO Psychology journals, using search terms “Mechanical Turk” and “MTurk.” The resulting sample consisted of 99 empirical papers using at least one MTurk sample (see supplemental materials). Even though MTurk was launched in 2005 (Barger et al. 2011), we did not identify any IO studies between 2005 and 2011. There has, however, been a steady increase in papers using MTurk samples since 2012 (rising from 7 papers in 2012 to 44 papers in the first half of 2015). Most papers (over 80 %) used MTurk participants in conjunction with other samples, including undergraduate students, working professionals, and graduates (e.g., MBA working students). About half (53.5 %) of the papers used experimental designs only, 29.3 % used nonexperimental survey designs only, and 9.1 % used both experimental and nonexperimental designs on MTurk.
We coded the 99 papers based on the reported use of a variety of sample inclusion criteria and quality control procedures. There are several stages during MTurk data collection where quality control procedures and/or sample inclusion criteria are necessary to increase data quality and sample representativeness (Mason and Suri 2012; Paolacci and Chandler 2014). For example, Huang et al. (2012, 2015a, b) found that the absence of insufficient effort responding (IER) detection can severely compromise validity, and IER is arguably more frequent among online crowdsourced samples. Almost one-third (31.3 %) of the papers we identified did not report any quality control or sample inclusion criteria. A broader Google Scholar search for MTurk papers conducted by Chandler et al. (2014) found comparable results, such that many researchers did not report quality control procedures such as participant exclusion criteria. Ran et al. (2015) also found that the reporting of IER prevention or detection is very rare in top-tier psychology and management journals. The most commonly reported sampling criteria and quality control procedures were location-based criteria (typically requiring participants to be from the United States, 35.4 %), post hoc removal of data points due to biases or data contamination (36.4 %), and sample characteristics criteria (e.g., employment status, employed hours, 27.3 %). Other reported criteria include the use of instructional manipulation check or attention items (21.2 %), the examination of completion time for data quality (4 %), and the use of MTurk system-based qualifications (e.g., approval ratings, 3 %).
Among the studies which did not include descriptions of inclusion criteria for study participation or quality control procedures, it is not clear to what extent these concerns reflect researchers neglecting to mention their sampling and quality control procedures or whether the researchers failed to implement any related procedures. Most of these papers simply cited previous MTurk literature to justify the quality of MTurk samples and supported the appropriate use of MTurk based on the diverse demographic makeup of MTurk samples reported in previous papers (e.g., Buhrmester et al. 2011; Paolacci et al. 2010). Although citing these review papers helps support the use of MTurk, it is insufficient to justify the data collection procedures used in a particular study without also utilizing appropriate quality control measures.
Some of these papers may have implemented quality control measures, but simply did not report them in their manuscripts. This may create a false impression that their MTurk samples are not of high quality or leave readers with unanswered questions about the nature of the data. Thus, we echo Paolacci and Chandler’s (2014) recommendations for researchers seeking to publish studies using MTurk samples to be very thoughtful about their study procedures and be transparent about their data collection and screening procedures.
An Evaluation and Practical Recommendations for the Use of MTurk
Although no method can guarantee the validity of an inference, research design and sample choices can have important consequences for validity (Shadish et al. 2002). Shadish et al. (2002) developed a validity typology in guiding researchers’ evaluation of a variety of research designs and methods, including (1) internal validity, (2) statistical conclusion validity, (3) external validity, and (4) construct validity. We draw from this typology to focus on ten methodological concerns with MTurk samples. We discuss each concern in relation to relevant types of validity evidence, highlighting how each issue can potentially threaten validity of inferences made from MTurk samples. We also provide a set of practical recommendations for the use of MTurk based on our evaluations (see Table 1 for the summary).1
Subject inattentiveness refers to the phenomenon where respondents answer a question without paying full attention to/complying with study instructions, accurately understanding item content and/or providing accurate responses. Inattentive responding can be particularly problematic within online samples because experimental or survey administrations are often unproctored (Fleischer et al. 2015).
Threat to Internal Validity
Subjective inattentiveness, and the related issue of IER (Huang et al. 2012, 2015b), represents an important threat to internal validity with MTurk samples when MTurk Workers do not attend to the experimental stimuli or study instructions, and as a result, the manipulation and measurements may not work effectively. For example, IER can introduce an additional source of extraneous variance that may be misinterpreted as part of the hypothesized effect—or confound the observed covariations—if participants do not pay attention to the survey instructions prior to responding to any of the survey items (McGonagle et al. 2016). As we will detail in our recommendations below, researchers collecting data from MTurk samples should take steps to monitor subjective inattentiveness and detect IER.
Threat to Statistical Conclusion Validity
Statistical conclusion validity refers to the extent to which statistical inferences made about the correlation (or covariation) between two variables are warranted. As with internal validity, subjective inattentiveness can threaten statistical conclusion validity for studies of MTurk Workers and other similar online samples (Fleischer et al. 2015). When MTurk Workers do not pay attention to the items they respond to, the reliability and quality of items can be substantially compromised. Past research has shown that IER can increase measurement error variance and thus attenuate or inflate observed correlations between variables (Huang et al. 2015a, b; Kam and Meyer 2015; McGrath et al. 2010).
Threat to Construct Validity
Subject inattentiveness can also affect construct validity. Construct validity refers to “the degree to which inferences are warranted from the observed persons, settings, and cause and effect operations included in a study to the constructs that these instances might represent” (Shadish et al. 2002, p. 38). It is especially problematic when scale development and validation efforts are based on data with substantial amounts of inattentive or careless responses (Meade and Craig 2012). It has been found that in samples with as few as 10 % careless responses, factor structures can become inaccurate, item correlations and model fit indices can be negatively impacted, and subsequent conclusions made about the measured constructs become unreliable (Fleischer et al. 2015; Meade and Craig 2012; Schmitt and Stults 1985; Woods 2006).
Recommendations for Screening for Subject Inattentiveness
MTurk respondents may differ in the levels of effort and attention exerted in their tasks. Recent research by Hauser and Schwarz (2016) found that MTurk Workers tend to be more attentive than traditional subject pool samples, but the detection of IER or inattentiveness is still important in ensuring data quality (Aust et al. 2013; Oppenheimer et al. 2009; Ran et al. 2015). Attention check questions (ACQs) are particularly important for data quality if researchers choose not to use MTurk Workers’ approval ratings as a participation criterion (Peer et al. 2014).
There are multiple steps researchers can take to detect and screen out inattentive/careless responding. For example, Meade and Craig (2012) recommended incorporating instructed response items, computing response consistency indices (e.g., response patterns), and conducting multivariate outlier analyses. Huang et al. (2012, 2015a, b) recommended using response time, infrequency items (i.e., attentive participants should provide the same responses to items with improbable factual statements), psychometric antonyms (i.e., testing inconsistent responses across dissimilar items), and computing individual reliability estimates to detect IER. Also, DeSimone, Harms, and DeSimone (2015) recommended using direct (e.g., instructed or bogus items), archival (e.g., response time, long string), and statistical (e.g., psychometric antonyms) screening techniques, and that multiple techniques should be used as they detect different types of problems related to subject inattentiveness.
Researchers implementing any of these data screening techniques should consider explaining to MTurk Workers in the HIT instructions/informed consent that response patterns will be monitored and any indications of random responding (without specifying the detection methods) will not result in compensation. Huang et al. (2015a) found that a benign warning provided to respondents about IER detection was effective at reducing IER without raising negative reactions. When responses are monitored, Workers are expected to pay closer attention to instructions, and in the event they fail a detection item, the researchers will have justifiable reasons to reject their work.
We also recommend that researchers offer second chances to Workers who fail data screening during their first attempt. Oppenheimer et al. (2009) found that after a prompt requesting that respondents pay closer attention to the instructions, the responding behaviors of those who initially failed the ACQs were indistinguishable from those who passed the ACQs. Offering Workers a second chance may improve data quality while limiting data loss without risking selection biases (Aust et al. 2013). We encourage researchers to explain in their instructions or informed consent documents that each Worker is allowed a specific number of attempts.2 Clarity about the maximum number of attempts provides strong justification for refusing further attempts to Workers or rejecting their HIT, minimizes perceptions of unfairness, and also protects Requesters’ reputation in the MTurk community.
Selection biases occur when MTurk Workers (1) self-select into the MTurk Worker population and (2) self-select into a particular study. While the former applies uniquely to MTurk, the latter occurs in almost all types of research studies that require human subject participation. Specifically, regardless of the research design, participants need to voluntarily avail themselves to be a part of the study (Woo et al. 2015). Issues related to selection biases on MTurk can pose threats to construct and external validity.
Threat to Construct Validity
Selection biases in MTurk samples may raise construct validity concerns because the participants sampled in a study can have a direct bearing on construct validity. Researchers must consider the fact that MTurk Workers self-selected to become a part of the MTurk population. That is, even a randomly selected MTurk sample would still have the possibility of selection biases by virtue of who participates in MTurk and who opts for particular HITs. In particular, self-selection can threaten the extent to which the participant characteristics correspond to the population to which inferences are made. For example, construct validity may become questionable if a study purported to study retirement intentions of older employees but used MTurk Workers who are predominantly younger as participants. In other words, there may be a lack of correspondence between the operational definitions used in the study and the measured constructs to which researchers draw inferences.
There are several distinct characteristics of MTurk Workers that may lead to self-selection biases, including irregular employment status, interests in monetary incentives, and inherent enjoyment of participating in HITs (Ipeirotis 2010). Depending on the study purposes, these factors may or may not weaken the validity of findings. The larger point that we will reiterate throughout our review is that the extent to which validity can be threatened by MTurk depends heavily on the research questions being investigated. There is not a “one-size-fits-all” answer to the appropriateness of MTurk; rather, researchers need to critically evaluate whether MTurk is appropriate for their research objectives.
Threat to External Validity
MTurk is commonly praised in overcoming external validity concerns for different study designs, including experimental, quasi-experimental, and nonexperimental (e.g., survey) designs. Unlike other frequently used samples (e.g., undergraduates and employees from the same organization/occupation), MTurk allows researchers to overcome some generalizability problems by gaining easy access to heterogeneous populations (i.e., with greater occupational and demographic diversity). MTurk samples can be particularly suitable for researchers who seek to understand work outside of WEIRD and traditional organizational contexts (e.g., humanitarian work psychology, cross-cultural psychology; Woo et al. 2015) or in hard-to-reach employee populations (e.g., employees who are disabled, marginalized, victims of workplace harassment, or have low socioeconomic status; Smith et al. 2015). Additionally, Bergman and Jean (2016) discussed the need for increased attention to underrepresented workers in organizational psychology, and MTurk may help facilitate studies of certain groups of underrepresented workers (e.g., underemployed workers).
The ability to sample MTurk Workers from a large participant pool (more than 500,000 from 190 countries; mturk.com, or an active population of about 7300 MTurk Workers; Stewart et al. 2015) strengthens the external validity argument, but it does not preclude the issues arising from the fact that random sampling is highly unlikely on MTurk. Random sampling can simplify external validity inferences because observed relationships are expected to be the same as any other random samples of the same size from the same population (Shadish et al. 2002). However, MTurk Workers self-selected to become a part of the MTurk population and they select HITs to complete based on their personal preferences. Respondent characteristics or personal preferences may become confounds of the observed relationships—a key feature of selection bias (Shadish et al. 2002). Moreover, the large MTurk participant pool characterized by an international membership (mostly from the U.S. and India) is ideal for testing organizational theories expected to be broadly applicable across different organizational contexts (Landers and Behrend 2015), but it may not work for phenomena relevant to certain industries or non-English-speaking organizations.
We should also note that the generalizability of a research study is not limited to variations across persons, but also across settings, treatments, and outcomes (Shadish et al. 2002). Therefore, depending on the research objectives, MTurk’s ability to sample from a more diverse pool may not necessarily overcome all types of generalizability concerns.
Recommendations for Evaluating Selection Biases
MTurk Workers self-select into the MTurk participant pool and have the discretion to choose which HITs to complete. Based on their research questions, researchers should evaluate—before they begin MTurk data collection—the extent to which self-selection to MTurk may violate the validity of their findings. For example, if a study involves constructs that are inherently reflected by the decision to sign up as an MTurk Worker, such as being tech-savvy, having access to computers, or having an interest in online surveys, then MTurk would not be recommended as a data source because the construct measurement or manipulation would be contaminated. However, if a study aims to examine psychological phenomena among a diverse population from various geographies and industries, an experimental manipulation would more likely be successful with MTurk Workers compared to employees from a single organization (Woo et al. 2015). MTurk Workers’ motivation to participate in a study can also affect experimental manipulations and survey responses. Researchers can include questions about their motivation to participate and investigate whether participants’ responses affect survey responses and resulting findings (e.g., motivation-related common method variance; McGonagle 2015).
Demand characteristics, or as Shadish et al. (2002) refer to as experimenter expectancies, are a potential methodological concern because researchers may influence responses from participants by conveying expectations regarding desirable (or correct) responses, and those expectations may become a part of the measured constructs and subsequently confound findings.
Threat to Internal Validity
The manner in which experimenter expectancies can be manifested differs with MTurk as compared to other settings in which it might be a concern. In traditional research settings, face-to-face interactions between researchers and participants can lead participants to react in response to their perceived expectations about desirable responses or behaviors. Although MTurk provides an added advantage that demand characteristics can be reduced due to a lack of face-to-face interactions between experimenters and participants (Highhouse and Zhang 2015), measurement contamination may still occur because MTurk Workers may communicate in MTurk forums and find out the study purposes of certain HITs (Schmidt 2015).
Threat to Construct Validity
Shadish et al. (2002) suggested that demand characteristics can be minimized by limiting contact between researchers and participants; MTurk is advantageous in that experimenter demand effects are less likely than in laboratory or field settings (Highhouse and Zhang 2015). The anonymity afforded by MTurk also reduces evaluation apprehension on the part of MTurk Workers. However, demand characteristics can still be apparent and provide cues about expected behaviors if MTurk Workers are asked to pass a number of qualification requirements before they can proceed to a HIT. MTurk Workers may be motivated to respond untruthfully, based on their perceived demand characteristics, in order to be qualified for a HIT or avoid losing approval ratings.
Participant motivation can also lead to reactive self-report changes and conformity to experimenter expectancies. MTurk Workers tend to be more honest in reporting behaviors and have less social desirability tendencies than in-person samples, and they are more comfortable with disclosing personal feelings due to the MTurk’s anonymous platform (Shapiro et al. 2013; Smith et al. 2015; Woo et al. 2015). However, socially desirable responding may still occur when participants’ payments and approval ratings are contingent on their behaviors (Antin and Shaw 2012). Participant motivation to earn money from HITs can thus contaminate the measured constructs, and can lead to issues such as changes in item quality (Fleischer et al. 2015).
Recommendations for Minimizing Demand Characteristics and Understanding Participant Motivation
As Schmidt (2015) stated, there is a vibrant online community of MTurk Workers (a unique MTurk feature that is uncommon to other sample sources). Multiple websites have been developed for Workers to rate Requesters, comment on them and their HITs, and communicate with Requesters and other Workers about individual HITs. Some examples of MTurk online communities include Turkopticon (http://turkopticon.ucsd.edu/), Turker Nation (http://www.turkernation.com/), and Reddit (http://www.reddit.com/r/mturk). Researchers are encouraged to actively monitor these websites when they collect data from Workers. In situations where deceptions or manipulations are involved, it is important to make sure that the study purposes are not revealed to other Workers; otherwise, contamination may occur and the integrity of findings may be compromised. Additionally, to the extent possible while adhering to principles of informed consent, researchers should avoid cues signaling to Workers ahead of time about the study purposes and desired participant characteristics (or eligibility criteria). That way, researchers can minimize demand characteristics and avoid Workers fabricating their identities to participate in a HIT.
Experimenter demand effects may vary depending on the nature of participant motivation. According to Podsakoff et al. (2012), motivational factors may cause biased responding. In order to fully understand the differential effects of their motives, researchers should measure Workers’ motives for participating in their studies (e.g., inherent enjoyment and monetary incentives). This allows researchers to better understand how Workers’ motivation might moderate the findings or change the study outcomes.
MTurk Workers are not limited in the number of HITs they can complete for each Requester. Repeated participation can occur especially if MTurk Workers are inclined to complete tasks published by their “favored” Requesters (Chandler et al. 2014). Repeated participation is a particularly prominent concern in online research due to the lack of face-to-face interaction between researchers and respondents. For instance, a two-wave study involving MTurk Workers who participated in the same set of experimental tasks at two points in time showed markedly smaller effect sizes in the second wave (Chandler et al. 2015). Additionally, a recent study of Internet panels identified four types of respondents, with professional respondent being one of them (Matthijsse et al. 2015). Even though Matthijsse et al. (2015) did not find substantial differences in data quality between professional and nonprofessional respondents, some demographic and motivational differences between the two groups may lead to inaccurate validity inferences.
Threat to Internal Validity
Repeated participation can cause problems with manipulations, especially with a potential for cross-experiment stimuli contamination, meaning that random assignments might not actually be completely random. For example, MTurk Workers who repeatedly participate in the HITs published by their favorite Requesters may have knowledge about the study purposes and the content of different experimental conditions. Evidence is mixed concerning the prevalence of habitual or repeated participation in MTurk. Berinsky et al. (2012) found that repeated survey-taking (i.e., multiple responses from a single IP address) is not a large problem in MTurk, whereas Harms and DeSimone (2015) highlighted that there are “professional” Turkers (e.g., those with Master Qualification) who are active MTurk users and their representation across multiple MTurk samples can cause problems such as sample nonindependence. Specifically, MTurk Workers may specialize in particular types of HITs, and their experiences in these HITs may confound the observed/hypothesized effects. Although repeated participation can threaten the internal validity of many experiments, it is not clear how severely it affects survey research. Ultimately, researchers should ask: what is the base rate of repeated participation among MTurk Workers, how would MTurk users’ nonnaiveté affect researchers’ ability to answer their research questions, and are naïve MTurk (i.e., nonrepeaters) users or seasoned/experienced MTurk users more suitable for their research questions?
Threat to Construct Validity
Habitual or repeated participation can create treatment diffusion—a threat to construct validity—because participants may receive information about conditions to which they were not assigned (Shadish et al. 2002). For example, a lucrative HIT may prompt MTurk users to create multiple accounts (even though it is prohibited under Amazon’s user agreement) and one person may participate in two or more study conditions under the disguise of different MTurk Worker IDs. Alternatively, the study purposes may be discussed among MTurk Workers on MTurk forums, causing potential treatment diffusion. However, it should be noted that cross-experiment stimuli contamination is arguably less likely on MTurk than in studies of workers from one organization (Highhouse and Zhang 2015). Even though MTurk Workers can talk among themselves on the Internet, employees who work within the same organization are more likely to communicate about the ‘treatment’ they receive due to physical proximity and familiarity (Shadish et al. 2002).
Recommendations for Reducing Repeated Participation
We encourage researchers to monitor discussions about their HITs on MTurk forums. In addition to MTurk forums, there are application plug-ins that allow MTurk Workers to monitor the activity of their “favored” Requesters and thus increase the chances of repeated participation (Chandler et al. 2014). By default, a Worker can only complete a HIT once. Researchers can deploy multiple surveys within the same HIT to avoid duplicated Workers. If researchers intend to combine multiple related HITs and consider them as one, they must take steps to ensure that all Workers are unique and there is not an overrepresentation of certain Workers in their samples (i.e., sample nonindependence). Examples of identifying information include MTurk Worker IDs and IP addresses. It is possible to have duplicated IP addresses if two different persons from the same household completed the same HIT. In this case, researchers should examine participant demographic characteristics before removing their data.
Nonnaïve or “professional” Workers who have completed a large number of HITs might be preferred in some instances (e.g., Master Workers), but researchers may risk them having foreknowledge of the study purposes or presence of attention check items, and the occurrence of treatment diffusion. It is difficult to completely rule out experienced Workers from participating in a study, but there are steps researchers can take to prevent their participation. Researchers who wish to recruit naïve MTurk Workers can establish qualification criteria that exclude more experienced Workers. For example, they can use system qualifications like the number of HITs completed (e.g., less than 10 HITs). Customized qualifications assigned by Requesters can be created according to the researchers’ needs. For instance, Requesters can use MTurk’s web interface or command line tools to exclude Workers who have completed a previous study, Workers who have completed related studies, or who belong to a certain demographic group by assigning a certain qualification value, so that these Workers cannot be granted HIT access.
Requesters can also maintain a pool of MTurk participants who meet their research criteria to either sample them for a future study or exclude them for having completed similar studies (Chandler et al. 2014). Chandler et al. (2014) also recommended the sharing of customized qualifications among a group of researchers with similar interests in a specific population, so that sample nonindependence can be minimized when they pool their samples and findings together. However, the importance of sample independence may depend on the research objectives and nonindependence may not always be problematic. For example, researchers considering repeated-measures study designs and examining within- and between-subject comparisons may benefit from within-person measurements, along with increased power and reduced effects of measurement error variance.
The common characteristics of MTurk samples (e.g., younger, Internet users, lower to middle-income households) may in some cases produce range-restricted samples. Researchers may attempt to estimate parameters of an unrestricted employee population but only have data from a restricted population (Hunter et al. 2006). Range restriction can reduce statistical power, weaken relationships, and lead to inaccurate conclusions if left uncorrected.
Threat to Statistical Conclusion Validity
Range restriction is particularly problematic when researchers sample from a single organization to answer questions about the general working population (Roulin 2015). MTurk may be a better alternative in this case, particularly in comparison to organizations with a highly idiosyncratic workforce. Research can also overcome issues related to statistical power using MTurk as a low-cost data source to obtain larger sample sizes with more diverse sets of participants. In addition, the anonymity afforded by MTurk tends to result in more honest responses from a more diverse set of respondents, especially toward personal questions such as health-related and sexual behaviors (Smith et al. 2015). However, we note that range restriction can still occur, especially if researchers screen MTurk participants based on factors to more representatively sample from their targeted populations. For example, researchers studying the effects of job stress on older workers may choose to filter participants based on their age and may therefore restrict the ranges on other measured variables. This is an example of a trade-off where one validity type may be compromised to enhance another type of validity. Shadish et al. (2002) noted that researchers need not worry about controlling all of the validity threats, but rather recognize the trade-offs and make justifiable decisions based on their priorities and research questions.
Recommendations for Evaluating Possible Range Restriction
Although range restriction is less likely among MTurk Workers, it can still occur if researchers use qualification requirements to screen out some Workers. Researchers should carefully consider their qualification requirements and ensure that they are crucial for their research questions. For example, researchers interested in studying experiences of older employees might find it important to recruit based on Workers’ age, but not necessarily based on Workers’ years of working experience. Therefore, to find a common ground in the trade-off between range restrictions and sample representativeness (an issue we discuss below), researchers must base their decisions on their research objectives.
Consistency of Treatment and Study Design Implementation
Shadish et al. (2002) noted that the unreliability or inconsistency of treatment implementation is problematic especially when the study design is meant to be implemented and interpreted in a standardized manner. This can be particularly problematic when researchers use samples from MTurk and other different sources to generate conclusions.
Threat to Statistical Conclusion Validity
We noted in our literature search above that many papers published in top IO journals used MTurk samples in conjunction with other types of samples (e.g., undergraduates and working professionals). Even though using samples from different sources may bolster researchers’ conclusions, it is sometimes unclear whether experimental manipulations or survey study designs were implemented in a consistent and reliable manner across the MTurk and non-MTurk studies. Without standardization and consistent implementation, using the different samples to make study conclusions can lead to misestimated effect sizes, and it would be difficult to attribute the effect sizes to different design features and/or constructs.
Recommendations for Ensuring Consistency of Treatment and Design Implementation
Skepticism about MTurk may prompt researchers to use other data sources, in addition to MTurk, to support their findings. Although this may overcome generalizability concerns, researchers must make sure that their study designs are implemented consistently from sample to sample. For example, if an online survey was deployed to an MTurk sample, the same online survey should be administered to an organizational sample. This obviously does not rule out environmental inconsistency given that Workers complete their HITs in different settings; however, researchers should make their best efforts in consistently administering their study. If a lack of standardization is inherent to some study designs (e.g., widely differed training provided to different units), researchers should measure the different study components and explore how they are related to changes in relationships and outcomes (Shadish et al. 2002).
The research settings afforded by MTurk through the Internet are different from traditional settings where researchers meet with participants face-to-face in the same physical environment (e.g., distributing surveys at an organization or conducting laboratory experiments). There are multiple factors that could contribute to extraneous variance in MTurk samples.
Threat to Internal Validity
In laboratory settings, many extraneous variables can be controlled for by putting participants in a uniform environmental setting, so that any systematic differences in the features of an environment will less likely contribute to errors in manipulations and measurements. Even though MTurk can facilitate randomized assignment, it cannot provide environmental uniformity given that MTurk Workers complete their HITs in different physical environments. Without understanding or measuring the sources of extraneous variance, internal validity of a study can be compromised.
Threat to Statistical Conclusion Validity
Inferences made about covariations and the strength of relationships can be erroneous if extraneous variables are not appropriately measured and controlled. Specifically, extraneous factors can introduce additional sources of variance that may be misinterpreted as part of the hypothesized/observed effects. For example, MTurk Workers may not be able to pay attention to the research study instructions due to the salience of environmental features (e.g., distracting noises); these features may add a systematic source of variance that researchers should take into consideration.
Threat to Construct Validity
The online MTurk platform cannot guarantee that the experimental settings theorized by researchers correspond with the empirical realization of the settings where MTurk Workers participate in the study. That is, construct validity would be questionable if extraneous factors (e.g., environmental distraction) introduce deviations from the settings and operational definitions assumed by the researchers.
Recommendations for Accounting for Extraneous Variables
Given that MTurk facilitates random assignments in settings where Workers are subject to different physical environmental influences, we encourage researchers to identify possible sources of noise, measure these extraneous variables during data collection, and include them in data analysis. These analyses can shed light on the extent to which extraneous factors change construct measurement, study relationships, and outcomes. Pilot studies can be conducted to identify these sources of extraneous factors. Extraneous variables common to online participation should be considered and controlled for across all MTurk samples, including their physical environment, browser experiences, environmental distraction, respondent interests, and motivation (Meade and Craig 2012). Researchers should also take proactive steps prior to data collection to minimize the effects of such factors. For example, they can specify in the instructions that participants must be in a quiet room when they complete the HITs or that participants must use a certain browser or software to complete the HITs.
Sample Representativeness and Appropriateness
The extent to which a sample is representative of a specific population or appropriate for the research objectives has implications for whether conclusions drawn from that sample apply to the population of interest. All types of samples can be evaluated for their representativeness and appropriateness, and MTurk samples are no exception.
Threat to External Validity
Sample representativeness is often discussed when evaluating the external validity of a research study. One of the criticisms of using online samples, such as those from MTurk, is that the identities of MTurk Workers are unknown. In addition, concerns about whether MTurk Workers represent the general population have been expressed in previous reviews, especially in light of the fact that MTurk Workers are Internet users and they may have systematic differences from non-Internet users (Paolacci and Chandler 2014). Although the diversity of MTurk Workers has been praised with regard to increasing external validity (e.g., Landers and Behrend 2015), certain demographic groups are over- or underrepresented on MTurk (e.g., age, education, and race). The suitability of MTurk samples would thus be best determined by the research questions researchers want answered.
Finally, we note that external validity evidence is not only limited to whether the samples are representative of or generalizable to a population, but also whether certain phenomena hold across settings. Single-organizational samples are limited partly because inferences are confounded with the fact that the employees went through the same hiring and selection procedures, orientations, training, socialization processes, etc. Researchers studying psychological phenomena expected to vary across settings may benefit from using MTurk because MTurk Workers are situated in a variety of settings. Researchers who are able to measure characteristics of their organizational settings may also be able to study contextual factors and their potential moderating effects on study relationships of interest.
Threat to Construct Validity
A lack of sample representativeness can also threaten construct validity. While external validity indicates the extent to which effects observed in one set of sampling particulars (e.g., persons, settings, treatments, and outcomes) are also observed in other sampling particulars, construct validity represents “the degree of correspondence between the constructs referenced by a researcher and their empirical realizations” (Stone-Romero 2011, p. 40).
Sample representativeness can threaten construct validity because it affects the extent to which a set of sampling particulars (e.g., participants, settings, treatments, and outcomes) correspond to the population to which researchers want to draw inference. For example, construct validity would be limited if researchers aim to study the behaviors of upper-level managers but their MTurk sample consists of only entry-level workers.
Recommendations for Maximizing Sample Representativeness and Determining Sample Appropriateness
Even though the diversity of MTurk Workers may benefit researchers from an external validity perspective, researchers should consider any possible trade-offs with statistical conclusion validity and/or construct validity. Specifically, while the heterogeneity of respondents creates greater variance on measures and may therefore affect the systematic covariation between variables, the homogeneity of respondents or treatment conditions may limit arguments for external validity. On the other hand, having participants from different (but relevant) populations may increase external validity, but it may threaten construct validity when they do not all strictly belong to the target population.
The nature of MTurk’s diverse participant pool needs to be understood as researchers decide whether to use MTurk and whether an MTurk sample would represent their targeted population. Since random sampling is not feasible with MTurk, researchers need to make their best efforts to ascertain that their sample characteristics closely resemble their population of interest. MTurk is unique in the way it creates system qualifications and allows Requesters to create customized qualifications based on desired sample characteristics (see Chandler et al. 2014). Researchers should recruit and select MTurk Workers by proactively utilizing both system and customized qualifications (e.g., age, location, gender, occupation, employment status) to increase the correspondence between their actual and desired sample characteristics. In administering qualification tests or questionnaire, researchers should avoid overt cues about the eligibility criteria that could influence participants’ responses. Researchers should also restrict Workers from attempting to take a qualification test more than a certain number of times, in order to prevent Workers from finding out the eligibility/inclusion criteria and subsequently giving “correct” but untruthful responses.
Researchers should also attempt to verify the desired characteristics of Workers and consistency in their responses. For example, if Workers are required to be full-time employees, researchers can verify their employment status (in addition to using qualification tests) by embedding questions that would only be answered affirmatively by someone who had the desired characteristics, such as their job title, work schedule, and salary. Inconsistent or implausible combinations of responses should be removed prior to any data analysis to avoid generalizability issues.
Finally, MTurk’s participant pool is international but it is by no means representative of the global workforce; MTurk Workers are predominantly U.S. citizens, Indians, and English speakers (Ipeirotis 2010). Therefore, MTurk samples may be most appropriate for testing theories or phenomena that are not expected to vary across cultures or that are specifically relevant to the U.S. or Indian samples. An investigation of, for example, cross-cultural issues among non-English-speaking organizations or employees may not be feasible on MTurk. Moreover, researchers sampling from MTurk Workers may miss some types of employees with certain characteristics (e.g., white-collar professionals) and, in some instances, a large single-organizational sample might be more appropriate. In other instances, researchers should carefully consider the relevance of measured constructs to MTurk Workers, and whether they are industry specific. For example, a study focusing on personality effects on job performance outcomes may be studied using an MTurk sample representing different occupations/industries, but a study focusing on the effects of industry-specific knowledge on job performance may not be relevant to all MTurk Workers. Therefore, we urge researchers to carefully consider whether MTurk samples are suitable for answering their research questions about their targeted populations, and not let MTurk’s cost-efficient and convenient nature dominate their decisions to use MTurk.
Consistency Between Construct Explication and Study Operations
Construct validity is assessed largely based on the extent to which the properties of operational definitions in a study are consistent with the properties of theorized constructs. Discrepancies between the two not only affect construct validity, but also other validity types and conclusions made about the constructs based on discrepant study operations. This issue applies to all types of samples, but it is particularly important when MTurk researchers use system/customized qualifications to select participants.
Threat to Construct Validity
Inferences about construct validity are more strongly supported when the characteristics of a sample collected from MTurk match the desired characteristics of a sample defined in a construct. As illustrated in Shadish et al.’s (2002) example, discrepancies between construct and operations may occur when a researcher is interested in the construct of unemployed and disadvantaged workers, and he/she samples from families below the poverty level who may not necessarily be unemployed or disadvantaged. A mismatch as such would undermine inferences about construct validity and lead to inaccurate conclusions made about the measured constructs.
Recommendations for Ensuring Consistency Between Construct Explication and Study Operations
As noted, a mismatch between construct explication and study operations can be problematic. For instance, a mismatch may occur if researchers interested in the construct of transformational leadership obtain a sample of MTurk Workers who are predominantly low-wage earners and upper-level employees are underrepresented. MTurk Workers in this sample may be asked questions about executive leadership styles that are irrelevant to them, because they have not had the opportunity to directly observe executives’ behavior. In addition to creating inconsistencies between the sample and the construct explication, erroneous conclusions may be drawn about executives based on perceptions of a low-wage worker sample. Such consistency between construct explication and study operations is not only important for persons, but also for settings, treatments, and outcomes. Researchers should thus evaluate the appropriateness of MTurk samples with careful consideration of the nature of the constructs they intend to measure.
Method bias is a commonly discussed methodological concern in the behavioral sciences (e.g., Spector 2006). A number of meta-analyses indicated that the impact of method biases on item validity and reliability can contribute to inappropriate conclusions if not appropriately controlled for (Podsakoff et al. 2012). Like any other sources of sample, researchers using MTurk samples should consider the possibility of method biases in light of their research questions/targeted populations and take steps to account for them in analyses.
Threat to Construct Validity
According to Shadish et al. (2002), mono-method bias is one of the method biases that can threaten the credibility of construct validity evidence. For example, using MTurk as a sole source of construct measurement may introduce problems with mono-method bias (also known as common method bias), a threat where the method (or measurement context) may become a part of the construct actually studied (Podsakoff et al. 2012; Shadish et al. 2002). MTurk is usually limited to one method in how treatments or surveys are presented to respondents given that Requesters can only distribute their HITs through the Web interface, and thus the measurement contexts are the same for all Workers. Another type of mono-method bias in MTurk samples is based on common rater effects, where the same respondents provide responses to both the predictor and criterion (Podsakoff et al. 2003). As of now, the MTurk platform does not have convenient and direct access to alternative rater sources, such as MTurk Workers’ supervisors, peers, or spouses.
It is important to note that mono-method bias is only one of many sources of method biases, but it is one that is particularly applicable to MTurk samples. Researchers should also consider the implications of other potential sources of method bias that are common to MTurk samples and other sample sources, such as item characteristic effects and item context effects (Podsakoff et al. 2003, 2012).
Recommendations for Examining Method Bias
The measurement context in MTurk is primarily the Web interface, and it may become a part of the measured constructs unless researches separate out the “method” factor in their analyses. Researchers using MTurk samples should examine whether method factors emerge in their latent measurement models; if they do, appropriate measures should be taken to control for them (Podsakoff et al. 2012). To minimize method biases arising due to the measurement context effects, researchers may consider adopting time-lagged research designs, where predictors and criterion variables are measured at different points in time; or predictors and criterion variables can be administered in different mediums (e.g., Qualtrics vs. MTurk interface). Other method biases such as common rater biases are harder to overcome on MTurk. MTurk does not currently have the capability to survey and match data from multiple rater sources (e.g., supervisors and coworkers). Requesters would have to administer the surveys to other raters through the MTurk Worker, which can cause a different set of problems (e.g., honesty and compensation issues). In this case, single-organizational samples would be easier to manage and more feasible in collecting data from multiple raters.
In this article, our evaluation of MTurk highlights ten methodological concerns in relation to their validity threats, and we offer a number of recommendations to minimize their threats to validity evidence. We also offer several future directions researchers should consider regarding the use of MTurk in organizational psychology. First, we encourage further investigation and documentation of how researcher decisions regarding sample inclusion criteria, quality control and data screening procedures in MTurk may influence study findings, and whether the reliability and validity of measures covary with these procedures. Echoing DeSimone et al.’s (2015) recommendation, we encourage researchers to report the results from before and after implementing data screening techniques to better understand their impact in research and highlight the importance of transparency.
Second, the prevalence and consequences of repeated participation on MTurk are unclear. Additional strategies may be necessary to identify instances of repeated participation and nonnaïve Workers. For example, are direct methods such as asking Workers about their past HIT experiences sufficient to identify nonnaïve Workers and understanding the impact of nonnaiveté on final results? Are there other steps that may be taken to prevent nonindependence of observations?
Third, the anonymity afforded by the MTurk allows more honest responses and increased access to hard-to-reach employee populations (e.g., vision- or hearing-impaired, marginalized individuals; Smith et al. 2015), but it is unclear whether MTurk’s platform is accessible from these Workers’ perspectives. Some studies have used MTurk to collect speech data (e.g., Callison-Burch and Dredze 2010), which may be a suitable tool for vision-impaired MTurk Workers. Future research should examine how researchers can utilize the unique MTurk platform to collect data from and accommodate these individuals.
Lastly, MTurk’s participant pool continues to expand and its demographics are likely changing as well. Updates on the demographic characteristics of MTurk’s participant pool are needed so that researchers can more accurately determine the appropriateness of MTurk samples. In addition, information about the nature of organizational membership among MTurk Workers would be particularly valuable for organizational psychology researchers. We hope that these future directions will encourage researchers to continue probing into the intricacies of MTurk and further our conversation about the use of MTurk in organizational psychology scholarship.
MTurk is an increasingly popular data source within the organizational psychology research community. It offers some clear advantages to researchers in terms of its ability to generate large and diverse datasets both quickly and at relatively low cost. Despite these advantages, MTurk also has some weaknesses that highlight the importance of using it in a rigorous and thoughtful manner. In our evaluation of MTurk, we discussed several methodological issues researchers should consider in the context of Shadish et al.’s (2002) validity typology. We encourage researchers to consider these strengths and limitations so that they can carefully consider the appropriateness of MTurk samples and study design implications in answering their research questions.
Two consistent themes throughout our review are that the appropriateness of MTurk samples depends primarily on the research questions researchers want answered, and that MTurk data quality depends on the strategies researchers adopt to increase data quality. There might be trade-offs between different validity types, and researchers need to carefully prioritize them in light of their research objectives. We encourage researchers to engage in critical evaluations of MTurk as we have outlined above, and follow the practical recommendations for the use of MTurk. Doing so should lead to better quality data from MTurk studies and reduce the likelihood that MTurk studies be dismissed solely based on unwarranted assumptions about such samples.
The ten methodological concerns are not presented in any particular order that indicates the importance or prevalence of each concern.
In our own data collections, we have allowed MTurk participants a maximum of two attempts and have received positive reviews from participants about offering them a second chance.