Behavior Research Methods

, Volume 46, Issue 1, pp 112–130 | Cite as

Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers

Article

Abstract

Crowdsourcing services—particularly Amazon Mechanical Turk—have made it easy for behavioral scientists to recruit research participants. However, researchers have overlooked crucial differences between crowdsourcing and traditional recruitment methods that provide unique opportunities and challenges. We show that crowdsourced workers are likely to participate across multiple related experiments and that researchers are overzealous in the exclusion of research participants. We describe how both of these problems can be avoided using advanced interface features that also allow prescreening and longitudinal data collection. Using these techniques can minimize the effects of previously ignored drawbacks and expand the scope of crowdsourcing as a tool for psychological research.

Keywords

Crowdsourcing Internet research Data quality Longitudinal research Mechanical Turk MTurk 

Supplementary material

13428_2013_365_MOESM1_ESM.xlsx (42 kb)
ESM 1(XLSX 42 kb)

References

  1. Amazon Mechanical Turk Requester Tour. (n.d.). Retrieved from https://requester.mturk.com/tour
  2. Anderson, N. H. (1968). Likableness ratings of 555 personality-trait words. Journal of Personality and Social Psychology, 9(3), 272Google Scholar
  3. Basso, M. R., Bornstein, R. A., & Lang, J. M. (1999). Practice effects on commonly used measures of executive function across twelve months. The Clinical Neuropsychologist, 13(3), 283–292. doi:10.1076/clin.13.3.283.1743 PubMedCrossRefGoogle Scholar
  4. Behrend, T., Sharek, D., Meade, A., & Wiebe, E. (2011). The viability of crowdsourcing for survey research. Behavior Research Methods, 43(3), 800–813. doi:10.3758/s13428-011-0081-0 PubMedCrossRefGoogle Scholar
  5. Berinsky, A. J., Huber, G. A., & Lenz, G. S. (2012). Evaluating online labor markets for experimental research: Amazon.com’s Mechanical Turk. Political Analysis, 20(3), 351–368. doi:10.1093/pan/mpr057 CrossRefGoogle Scholar
  6. Bodenhausen, G. V. (1990). Stereotypes as judgmental heuristics: Evidence of circadian variations in discrimination. Psychological Science, 1, 319–322. doi:10.1111/j.1467-9280.1990.tb00226.x CrossRefGoogle Scholar
  7. Brock, T. C., & Becker, L. A. (1966). 'Debriefing' and susceptibility to subsequent experimental manipulations. Journal of Experimental Social Psychology, 2, 3–5. doi:10.1016/0022-1031(66)90087-4 CrossRefGoogle Scholar
  8. Buchanan, T. (2000). Potential of the Internet for personality research. In M. H. Birnbaum (Ed.), Psychological experiments on the Internet (pp. 121–140). San Diego: Academic Press.CrossRefGoogle Scholar
  9. Buhrmester, M. D., Kwang, T., & Gosling, S. D. (2011). Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6, 3–5. doi:10.1177/1745691610393980 CrossRefGoogle Scholar
  10. Cacioppo, J. T., Petty, R. E., & Feng Kao, C. (1984). The efficient assessment of need for cognition. Journal of Personality Assessment, 48(3), 306–307. doi:10.1207/s15327752jpa4803_13 Google Scholar
  11. Chandler, J., Paolacci, G., & Mueller, P. (2013). Risks and rewards of crowdsourcing marketplaces. In P. Michelucci (Ed.) Handbook of Human Computation. New York: Sage.Google Scholar
  12. Chilton, L. B., Horton, J. J., Miller, R. C., & Azenkot, S. (2009). Task search in a human computation market. In Proceedings of the ACM SIGKDD workshop on human computation (pp. 1–9). In P. Bennett, R. Chandrasekar, M. Chickering, P. Ipeirotis, E. Law, A. Mityagin, F. Provost, & L. von Ahn (Eds.), HCOMP ’09: Proceedings of the ACM SIGKDD Workshop on Human Computation (77–85). New York: ACM. doi:10.1145/1837885.1837889 Google Scholar
  13. Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee, J., Beenan, M., . . . Foldit Players (2010). Predicting protein structures with a multilayer online game. Nature, 466, 756–760. doi:10.1038/nature09304
  14. Danaher, K., & Crandall, C. S. (2008). Stereotype threat in applied settings re–examined. Journal of Applied Social Psychology, 38(6), 1639–1655. doi:10.1111/j.1559-1816.2008.00362.x CrossRefGoogle Scholar
  15. Downs, J. S., Holbrook, M., & Peel, E. (2012). Screening Participants on Mechanical Turk: Techniques and Justifications. Vancouver: Paper presented at the annual conference of the Association for Consumer Research. October 2012.Google Scholar
  16. Downs, J. S., Holbrook, M. B., Sheng, S., & Cranor, L. F. (2010). Are your participants gaming the system? Screening Mechanical Turk workers. In Proceedings of the 28th international conference on Human factors in computing systems (pp. 2399–2402). New York: ACM. doi:10.1145/1753326.1753688 Google Scholar
  17. Edlund, J. E., Sagarin, B. J., Skowronski, J. J., Johnson, S. J., & Kutter, J. (2009). Whatever happens in the laboratory stays in the laboratory: The prevalence and prevention of participant crosstalk. Personality and Social Psychology Bulletin, 35, 635–642. doi:10.1177/0146167208331255 PubMedCrossRefGoogle Scholar
  18. Fiske, S. T., & Taylor, S. E. (1984). Social cognition. New York: Random HouseGoogle Scholar
  19. Ellsworth, P. C., & Gonzalez, R. (2003). Questions and comparisons: Methods of research in social psychology. In M. Hogg & J. Cooper (Eds.), The Sage Handbook of Social Psychology (pp. 24–42). London: Sage Publications, Ltd.Google Scholar
  20. Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41, 1149–1160. doi:10.3758/BRM.41.4.1149 PubMedCrossRefGoogle Scholar
  21. Finucane, M. L., & Gullion, C. M. (2010). Developing a tool for measuring the decision-making competence of older adults. Psychology and Aging, 25(2), 271. doi:10.1037/a0019106 PubMedCentralPubMedCrossRefGoogle Scholar
  22. Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Perspectives, 19(4), 25–42.CrossRefGoogle Scholar
  23. Gaggioli, A., & Riva, G. (2008). Working the Crowd. Science, 12, 1443. doi:10.1126/science.321.5895.1443a CrossRefGoogle Scholar
  24. Glinski, R. J., Glinski, B. C., & Slatin, G. T. (1970). Nonnaivety contamination in conformity experiments: sources, effects, and implications for control. Journal of Personality and Social Psychology, 16, 478–485. doi:10.1037/h0030073 CrossRefGoogle Scholar
  25. Goldin, G., Darlow, A. (2013). TurkGate (Version 0.4.0) [Software]. Available from, http://gideongoldin.github.com/TurkGate/
  26. Goodman, J. K., Cryder, C. E., & Cheema, A. (2012). Data Collection in a Flat World: The Strengths and Weaknesses of Mechanical Turk Samples. Journal of Behavioral Decision Making.Google Scholar
  27. Gosling, S., Vazire, S., Srivastava, S., & John, O. (2004). Should we trust web-based studies? A Comparative Analysis of Six Preconceptions About Internet Questionnaires. American Psychologist, 59, 93–104. doi:10.1037/0003-066X.59.2.93 PubMedCrossRefGoogle Scholar
  28. Hansen, W. B., Tobler, N. S., & Graham, J. W. (1990). Attrition in Substance Abuse Prevention Research. Evaluation Review, 14, 677–685. doi:10.1177/0193841X9001400608 CrossRefGoogle Scholar
  29. Horton, J. J., Rand, D. G., & Zeckhauser, R. J. (2011). The online laboratory: Conducting experiments in a real labor market. Experimental Economics, 4, 399–42. doi:10.1007/s10683-011-9273-9 CrossRefGoogle Scholar
  30. Ipeirotis, P. (2010). Demographics of Mechanical Turk. CeDER-1001 working paper, New York University.Google Scholar
  31. Johnson, J. A. (2005). Ascertaining the validity of Web-based personality inventories. Journal of Research in Personality, 39, 103–129. doi:10.1016/j.jrp.2004.09.009 CrossRefGoogle Scholar
  32. Kittur, A., Chi, E. H., & Suh, B. (2008). Crowdsourcing user studies with Mechanical Turk. In Proceedings of the ACM conference on human factors in computing systems (pp. 453–456). New York: ACM.Google Scholar
  33. Krantz, J. H., & Dalal, R. (2000). Validity of web-based psychological research. In M. H. Birnbaum (Ed.), Psychological experiments on the Internet (pp. 35–60). New York: Academic Press.CrossRefGoogle Scholar
  34. Lintott, C. J., Schawinski, K., Slosar, A., Land, K., Bamford, S., Thomas, D., . . . Vandenberg, J. (2008). Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey. Monthly Notices of the Royal Astronomical Society, 389(3), 1179-1189Google Scholar
  35. Mason, W., & Suri, S. (2012). Conducting behavioral research on Amazon’s Mechanical Turk. Behavior Research Methods, 44(1), 1–23. doi:10.3758/s13428-011-0124-6 PubMedCrossRefGoogle Scholar
  36. Mata, A., Fiedler, K., Ferreira, M. B., & Almeida, T. (2013). Reasoning about others’ reasoning. Journal of Experimental Social Psychology.Google Scholar
  37. Mueller, P., & Chandler, J. (2012). Emailing Workers Using Python (March 3, 2012). Available at SSRN: http://ssrn.com/abstract=2100601
  38. Munson, S. A., & Resnick, P. (2010). Presenting diverse political opinions: How and how much. In E. Mynatt, G. Fitzpatrick, S. Hudson, K. Edwards, & T. Rodden (Eds.), Proceedings of the 28th International Conference on Human Factors in Computing Systems (pp. 1457–1466). New York: Association for Computing Machinery. doi:10.1145/1753326.1753543 Google Scholar
  39. Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology, 45, 867–872. doi:10.1016/j.jesp.2009.03.009 CrossRefGoogle Scholar
  40. Paolacci, G., Chandler, J., & Ipeirotis, P. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5, 411–419.Google Scholar
  41. Paxton, J. M., Ungar, L., & Greene, J. D. (2012). Reflection and reasoning in moral judgment. Cognitive Science, 36(1), 163–177.PubMedCrossRefGoogle Scholar
  42. Peer, E., Paolacci, G., Chandler, J., & Mueller, P. (2012). Selectively Recruiting Participants from Amazon Mechanical Turk Using Qualtrics (May 2, 2012). Available at SSRN: http://ssrn.com/abstract=2100631
  43. Pope, D., & Simonsohn, U. (2011). Round numbers as goals: Evidence from baseball, SAT takers, and the lab. Psychological Science, 22(1), 71–79.PubMedCrossRefGoogle Scholar
  44. Rand, D. G. (2012). The promise of Mechanical Turk: How online labor markets can help theorists run behavioral experiments. Journal of Theoretical Biology, 299, 172–179. doi:10.1016/j.jtbi.2011.03.004 PubMedCrossRefGoogle Scholar
  45. Reips, U. D. (2000). The Web experiment method: Advantages, disadvantages and solutions. In M. H. Birnbaum (Ed.), Psychological experiments on the Internet (pp. 89–114). San Diego: Academic Press.CrossRefGoogle Scholar
  46. Ribisl, K. M., Walton, M. A., Mowbray, C. T., Luke, D. A., Davidson, W. S., & Bootsmiller, B. J. (1999). Minimizing participant attrition in panel studies through the use of effective retention and tracking strategies: Review and recommendations. Evaluation and Program Planning, 19, 1–25. doi:10.1016/0149-7189(95)00037-2 CrossRefGoogle Scholar
  47. Rosch, E. (1975). Cognitive reference points. Cognitive Psychology, 7(4), 532–547.CrossRefGoogle Scholar
  48. Rosnow, R. L., & Aiken, L. S. (1973). Mediation of artifacts in behavioral research. Journal of Experimental Social Psychology, 9(3), 181–201. doi:10.1016/0022-1031(73)90009-7 CrossRefGoogle Scholar
  49. Sawyer, A. G. (1975). Demand artifacts in laboratory experiments in consumer research. Journal of Consumer Research, 1(4), 20–30. doi:10.1086/208604 CrossRefGoogle Scholar
  50. Shapiro, D. N., Chandler, J. J., & Mueller, P. A. (2013). Using Mechanical Turk to Study Clinical and Subclinical Populations.Google Scholar
  51. Shenhav, A., Rand, D. G., & Greene, J. D. (2012). Divine intuition: Cognitive style influences belief in God. Journal of Experimental Psychology. General, 141(3), 423.PubMedCrossRefGoogle Scholar
  52. Silverman, I., Shulman, A. D., & Wiesenthal, D. L. (1970). Effects of deceiving and debriefing psychological subjects on performance in later experiments. Journal of Personality and Social Psychology, 14(3), 203–212. doi:10.1037/h0028852 CrossRefGoogle Scholar
  53. Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. doi:10.1177/0956797611417632 PubMedCrossRefGoogle Scholar
  54. Sprouse, J. (2011). A validation of Amazon Mechanical Turk for the collection of acceptability judgments in linguistic theory. Behavior Research Methods, 43(1), 155–167. doi:10.3758/s13428-010-0039-7 PubMedCentralPubMedCrossRefGoogle Scholar
  55. Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance of African Americans. Journal of Personality and Social Psychology, 69(5), 797–811.PubMedCrossRefGoogle Scholar
  56. Summerville, A., & Chartier, C. R. (2012). Pseudo-dyadic “interaction” on Amazon’s Mechanical Turk. Behavior Research Methods, 1-9. doi:10.3758/s13428-012-0250-9
  57. Suri, S., & Watts, D. J. (2011). Cooperation and Contagion in Web-Based, Networked Public Goods Experiments. PLoS One, 6(3), e16836. doi:10.1371/journal.pone.0016836 PubMedCentralPubMedCrossRefGoogle Scholar
  58. von Ahn, L., Maurer, B., McMillen, C., Abraham, D., & Blum, M. (2008). reCAPTCHA: Human-Based Character Recognition via Web Security Measures. Science, 321, 1465–1468. doi:10.1126/science.1160379 CrossRefGoogle Scholar
  59. West, R. F., Meserve, R. J., & Stanovich, K. E. (2012). Cognitive sophistication does not attenuate the bias blind spot. Journal of Personality and Social Psychology, 103(3), 506–519.PubMedCrossRefGoogle Scholar

Copyright information

© Psychonomic Society, Inc. 2013

Authors and Affiliations

  • Jesse Chandler
    • 1
  • Pam Mueller
    • 2
  • Gabriele Paolacci
    • 3
  1. 1.Woodrow Wilson School of Public AffairsPrinceton UniversityPrincetonUSA
  2. 2.Department of PsychologyPrinceton UniversityPrincetonUSA
  3. 3.Rotterdam School of ManagementErasmus UniversityRotterdamNetherlands

Personalised recommendations