Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers

Abstract

Crowdsourcing services—particularly Amazon Mechanical Turk—have made it easy for behavioral scientists to recruit research participants. However, researchers have overlooked crucial differences between crowdsourcing and traditional recruitment methods that provide unique opportunities and challenges. We show that crowdsourced workers are likely to participate across multiple related experiments and that researchers are overzealous in the exclusion of research participants. We describe how both of these problems can be avoided using advanced interface features that also allow prescreening and longitudinal data collection. Using these techniques can minimize the effects of previously ignored drawbacks and expand the scope of crowdsourcing as a tool for psychological research.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2

References

  1. Amazon Mechanical Turk Requester Tour. (n.d.). Retrieved from https://requester.mturk.com/tour

  2. Anderson, N. H. (1968). Likableness ratings of 555 personality-trait words. Journal of Personality and Social Psychology, 9(3), 272

    Google Scholar 

  3. Basso, M. R., Bornstein, R. A., & Lang, J. M. (1999). Practice effects on commonly used measures of executive function across twelve months. The Clinical Neuropsychologist, 13(3), 283–292. doi:10.1076/clin.13.3.283.1743

    PubMed  Article  Google Scholar 

  4. Behrend, T., Sharek, D., Meade, A., & Wiebe, E. (2011). The viability of crowdsourcing for survey research. Behavior Research Methods, 43(3), 800–813. doi:10.3758/s13428-011-0081-0

    PubMed  Article  Google Scholar 

  5. Berinsky, A. J., Huber, G. A., & Lenz, G. S. (2012). Evaluating online labor markets for experimental research: Amazon.com’s Mechanical Turk. Political Analysis, 20(3), 351–368. doi:10.1093/pan/mpr057

    Article  Google Scholar 

  6. Bodenhausen, G. V. (1990). Stereotypes as judgmental heuristics: Evidence of circadian variations in discrimination. Psychological Science, 1, 319–322. doi:10.1111/j.1467-9280.1990.tb00226.x

    Article  Google Scholar 

  7. Brock, T. C., & Becker, L. A. (1966). 'Debriefing' and susceptibility to subsequent experimental manipulations. Journal of Experimental Social Psychology, 2, 3–5. doi:10.1016/0022-1031(66)90087-4

    Article  Google Scholar 

  8. Buchanan, T. (2000). Potential of the Internet for personality research. In M. H. Birnbaum (Ed.), Psychological experiments on the Internet (pp. 121–140). San Diego: Academic Press.

    Google Scholar 

  9. Buhrmester, M. D., Kwang, T., & Gosling, S. D. (2011). Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6, 3–5. doi:10.1177/1745691610393980

    Article  Google Scholar 

  10. Cacioppo, J. T., Petty, R. E., & Feng Kao, C. (1984). The efficient assessment of need for cognition. Journal of Personality Assessment, 48(3), 306–307. doi:10.1207/s15327752jpa4803_13

    Google Scholar 

  11. Chandler, J., Paolacci, G., & Mueller, P. (2013). Risks and rewards of crowdsourcing marketplaces. In P. Michelucci (Ed.) Handbook of Human Computation. New York: Sage.

  12. Chilton, L. B., Horton, J. J., Miller, R. C., & Azenkot, S. (2009). Task search in a human computation market. In Proceedings of the ACM SIGKDD workshop on human computation (pp. 1–9). In P. Bennett, R. Chandrasekar, M. Chickering, P. Ipeirotis, E. Law, A. Mityagin, F. Provost, & L. von Ahn (Eds.), HCOMP ’09: Proceedings of the ACM SIGKDD Workshop on Human Computation (77–85). New York: ACM. doi:10.1145/1837885.1837889

    Google Scholar 

  13. Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee, J., Beenan, M., . . . Foldit Players (2010). Predicting protein structures with a multilayer online game. Nature, 466, 756–760. doi:10.1038/nature09304

  14. Danaher, K., & Crandall, C. S. (2008). Stereotype threat in applied settings re–examined. Journal of Applied Social Psychology, 38(6), 1639–1655. doi:10.1111/j.1559-1816.2008.00362.x

    Article  Google Scholar 

  15. Downs, J. S., Holbrook, M., & Peel, E. (2012). Screening Participants on Mechanical Turk: Techniques and Justifications. Vancouver: Paper presented at the annual conference of the Association for Consumer Research. October 2012.

    Google Scholar 

  16. Downs, J. S., Holbrook, M. B., Sheng, S., & Cranor, L. F. (2010). Are your participants gaming the system? Screening Mechanical Turk workers. In Proceedings of the 28th international conference on Human factors in computing systems (pp. 2399–2402). New York: ACM. doi:10.1145/1753326.1753688

    Google Scholar 

  17. Edlund, J. E., Sagarin, B. J., Skowronski, J. J., Johnson, S. J., & Kutter, J. (2009). Whatever happens in the laboratory stays in the laboratory: The prevalence and prevention of participant crosstalk. Personality and Social Psychology Bulletin, 35, 635–642. doi:10.1177/0146167208331255

    PubMed  Article  Google Scholar 

  18. Fiske, S. T., & Taylor, S. E. (1984). Social cognition. New York: Random House

  19. Ellsworth, P. C., & Gonzalez, R. (2003). Questions and comparisons: Methods of research in social psychology. In M. Hogg & J. Cooper (Eds.), The Sage Handbook of Social Psychology (pp. 24–42). London: Sage Publications, Ltd.

    Google Scholar 

  20. Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41, 1149–1160. doi:10.3758/BRM.41.4.1149

    PubMed  Article  Google Scholar 

  21. Finucane, M. L., & Gullion, C. M. (2010). Developing a tool for measuring the decision-making competence of older adults. Psychology and Aging, 25(2), 271. doi:10.1037/a0019106

    PubMed Central  PubMed  Article  Google Scholar 

  22. Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Perspectives, 19(4), 25–42.

    Article  Google Scholar 

  23. Gaggioli, A., & Riva, G. (2008). Working the Crowd. Science, 12, 1443. doi:10.1126/science.321.5895.1443a

    Article  Google Scholar 

  24. Glinski, R. J., Glinski, B. C., & Slatin, G. T. (1970). Nonnaivety contamination in conformity experiments: sources, effects, and implications for control. Journal of Personality and Social Psychology, 16, 478–485. doi:10.1037/h0030073

    Article  Google Scholar 

  25. Goldin, G., Darlow, A. (2013). TurkGate (Version 0.4.0) [Software]. Available from, http://gideongoldin.github.com/TurkGate/

  26. Goodman, J. K., Cryder, C. E., & Cheema, A. (2012). Data Collection in a Flat World: The Strengths and Weaknesses of Mechanical Turk Samples. Journal of Behavioral Decision Making.

  27. Gosling, S., Vazire, S., Srivastava, S., & John, O. (2004). Should we trust web-based studies? A Comparative Analysis of Six Preconceptions About Internet Questionnaires. American Psychologist, 59, 93–104. doi:10.1037/0003-066X.59.2.93

    PubMed  Article  Google Scholar 

  28. Hansen, W. B., Tobler, N. S., & Graham, J. W. (1990). Attrition in Substance Abuse Prevention Research. Evaluation Review, 14, 677–685. doi:10.1177/0193841X9001400608

    Article  Google Scholar 

  29. Horton, J. J., Rand, D. G., & Zeckhauser, R. J. (2011). The online laboratory: Conducting experiments in a real labor market. Experimental Economics, 4, 399–42. doi:10.1007/s10683-011-9273-9

    Article  Google Scholar 

  30. Ipeirotis, P. (2010). Demographics of Mechanical Turk. CeDER-1001 working paper, New York University.

  31. Johnson, J. A. (2005). Ascertaining the validity of Web-based personality inventories. Journal of Research in Personality, 39, 103–129. doi:10.1016/j.jrp.2004.09.009

    Article  Google Scholar 

  32. Kittur, A., Chi, E. H., & Suh, B. (2008). Crowdsourcing user studies with Mechanical Turk. In Proceedings of the ACM conference on human factors in computing systems (pp. 453–456). New York: ACM.

    Google Scholar 

  33. Krantz, J. H., & Dalal, R. (2000). Validity of web-based psychological research. In M. H. Birnbaum (Ed.), Psychological experiments on the Internet (pp. 35–60). New York: Academic Press.

    Google Scholar 

  34. Lintott, C. J., Schawinski, K., Slosar, A., Land, K., Bamford, S., Thomas, D., . . . Vandenberg, J. (2008). Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey. Monthly Notices of the Royal Astronomical Society, 389(3), 1179-1189

  35. Mason, W., & Suri, S. (2012). Conducting behavioral research on Amazon’s Mechanical Turk. Behavior Research Methods, 44(1), 1–23. doi:10.3758/s13428-011-0124-6

    PubMed  Article  Google Scholar 

  36. Mata, A., Fiedler, K., Ferreira, M. B., & Almeida, T. (2013). Reasoning about others’ reasoning. Journal of Experimental Social Psychology.

  37. Mueller, P., & Chandler, J. (2012). Emailing Workers Using Python (March 3, 2012). Available at SSRN: http://ssrn.com/abstract=2100601

  38. Munson, S. A., & Resnick, P. (2010). Presenting diverse political opinions: How and how much. In E. Mynatt, G. Fitzpatrick, S. Hudson, K. Edwards, & T. Rodden (Eds.), Proceedings of the 28th International Conference on Human Factors in Computing Systems (pp. 1457–1466). New York: Association for Computing Machinery. doi:10.1145/1753326.1753543

    Google Scholar 

  39. Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology, 45, 867–872. doi:10.1016/j.jesp.2009.03.009

    Article  Google Scholar 

  40. Paolacci, G., Chandler, J., & Ipeirotis, P. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5, 411–419.

    Google Scholar 

  41. Paxton, J. M., Ungar, L., & Greene, J. D. (2012). Reflection and reasoning in moral judgment. Cognitive Science, 36(1), 163–177.

    PubMed  Article  Google Scholar 

  42. Peer, E., Paolacci, G., Chandler, J., & Mueller, P. (2012). Selectively Recruiting Participants from Amazon Mechanical Turk Using Qualtrics (May 2, 2012). Available at SSRN: http://ssrn.com/abstract=2100631

  43. Pope, D., & Simonsohn, U. (2011). Round numbers as goals: Evidence from baseball, SAT takers, and the lab. Psychological Science, 22(1), 71–79.

    PubMed  Article  Google Scholar 

  44. Rand, D. G. (2012). The promise of Mechanical Turk: How online labor markets can help theorists run behavioral experiments. Journal of Theoretical Biology, 299, 172–179. doi:10.1016/j.jtbi.2011.03.004

    PubMed  Article  Google Scholar 

  45. Reips, U. D. (2000). The Web experiment method: Advantages, disadvantages and solutions. In M. H. Birnbaum (Ed.), Psychological experiments on the Internet (pp. 89–114). San Diego: Academic Press.

    Google Scholar 

  46. Ribisl, K. M., Walton, M. A., Mowbray, C. T., Luke, D. A., Davidson, W. S., & Bootsmiller, B. J. (1999). Minimizing participant attrition in panel studies through the use of effective retention and tracking strategies: Review and recommendations. Evaluation and Program Planning, 19, 1–25. doi:10.1016/0149-7189(95)00037-2

    Article  Google Scholar 

  47. Rosch, E. (1975). Cognitive reference points. Cognitive Psychology, 7(4), 532–547.

    Article  Google Scholar 

  48. Rosnow, R. L., & Aiken, L. S. (1973). Mediation of artifacts in behavioral research. Journal of Experimental Social Psychology, 9(3), 181–201. doi:10.1016/0022-1031(73)90009-7

    Article  Google Scholar 

  49. Sawyer, A. G. (1975). Demand artifacts in laboratory experiments in consumer research. Journal of Consumer Research, 1(4), 20–30. doi:10.1086/208604

    Article  Google Scholar 

  50. Shapiro, D. N., Chandler, J. J., & Mueller, P. A. (2013). Using Mechanical Turk to Study Clinical and Subclinical Populations.

  51. Shenhav, A., Rand, D. G., & Greene, J. D. (2012). Divine intuition: Cognitive style influences belief in God. Journal of Experimental Psychology. General, 141(3), 423.

    PubMed  Article  Google Scholar 

  52. Silverman, I., Shulman, A. D., & Wiesenthal, D. L. (1970). Effects of deceiving and debriefing psychological subjects on performance in later experiments. Journal of Personality and Social Psychology, 14(3), 203–212. doi:10.1037/h0028852

    Article  Google Scholar 

  53. Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. doi:10.1177/0956797611417632

    PubMed  Article  Google Scholar 

  54. Sprouse, J. (2011). A validation of Amazon Mechanical Turk for the collection of acceptability judgments in linguistic theory. Behavior Research Methods, 43(1), 155–167. doi:10.3758/s13428-010-0039-7

    PubMed Central  PubMed  Article  Google Scholar 

  55. Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance of African Americans. Journal of Personality and Social Psychology, 69(5), 797–811.

    PubMed  Article  Google Scholar 

  56. Summerville, A., & Chartier, C. R. (2012). Pseudo-dyadic “interaction” on Amazon’s Mechanical Turk. Behavior Research Methods, 1-9. doi:10.3758/s13428-012-0250-9

  57. Suri, S., & Watts, D. J. (2011). Cooperation and Contagion in Web-Based, Networked Public Goods Experiments. PLoS One, 6(3), e16836. doi:10.1371/journal.pone.0016836

    PubMed Central  PubMed  Article  Google Scholar 

  58. von Ahn, L., Maurer, B., McMillen, C., Abraham, D., & Blum, M. (2008). reCAPTCHA: Human-Based Character Recognition via Web Security Measures. Science, 321, 1465–1468. doi:10.1126/science.1160379

    Article  Google Scholar 

  59. West, R. F., Meserve, R. J., & Stanovich, K. E. (2012). Cognitive sophistication does not attenuate the bias blind spot. Journal of Personality and Social Psychology, 103(3), 506–519.

    PubMed  Article  Google Scholar 

Download references

Author Note

Jesse Chandler, Postdoctoral Research Associate, Woodrow Wilson School of Public Policy, Princeton University (jjchandl@umich.edu), Pam Mueller, Graduate Student, Department of Psychology, Princeton University (pamuelle@princeton.edu); Gabriele Paolacci, Assistant Professor, Department of Marketing Management, Rotterdam School of Management, Erasmus University (gpaolacci@rsm.nl).

Jesse Chandler is now at PRIME Research, Ann Arbor, MI and The Institute for Social Research, University of Michigan.

The authors wish to thank John Myles White for help developing and testing the API syntax and Elizabeth Ingriselli for her help coding data.

Correspondence concerning this article can be addressed to any of the authors.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jesse Chandler.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(XLSX 42 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Chandler, J., Mueller, P. & Paolacci, G. Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers. Behav Res 46, 112–130 (2014). https://doi.org/10.3758/s13428-013-0365-7

Download citation

Keywords

  • Crowdsourcing
  • Internet research
  • Data quality
  • Longitudinal research
  • Mechanical Turk
  • MTurk