A streamlined approach to online linguistic surveys


More and more researchers in linguistics use large-scale experiments to test hypotheses about the data they research, in addition to more traditional informant work. In this paper we describe a new set of free, open-source tools that allow linguists to post studies online, turktools. These tools allow for the creation of a wide range of linguistic tasks, including grammaticality surveys, sentence completion tasks, and picture-matching tasks, allowing for easily implemented large-scale linguistic studies. Our tools further help streamline the design of such experiments and assist in the extraction and analysis of the resulting data. Surveys created using the tools described in this paper can be posted on Amazon’s Mechanical Turk service, a popular crowdsourcing platform that mediates between ‘Requesters’ who can post surveys online and ‘Workers’ who complete them. This allows many linguistic surveys to be completed within hours or days and at relatively low costs. Alternatively, researchers can host these randomized experiments on their own servers using a supplied server-side component.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3


  1. 1.

    In AMT jargon, these tasks are called Human Intelligence Tasks, or HITs. The organization of linguistic surveys into HITs will be discussed in the Appendix (online).

  2. 2.

    The process of designing an experiment can itself be very valuable. As is often the case, expanding the scope of one’s investigation can lead to interesting findings about the factors that affect the phenomenon in question. Although this goal by itself can be achieved without experimentation, we believe that the exercise of turning a theoretical research question into a testable set of experimental predictions can inform one’s thinking about the problem.

  3. 3.

    Participants in university lab settings often tend to be college students, and hence have a restricted distribution of age, education, and socio-economic status.

  4. 4.

    An anonymous reviewer asks whether there has been a comparison of AMT and lab data for tasks involving timing, for example for Self-Paced Reading. To the best of our knowledge, although there is ongoing work attempting to answer this question (see Tily and Gibson 2015), there are no published results.

  5. 5.

    A screen capture of this map can be found at http://turktools.net/crowdsourcing/.

  6. 6.

    Data collected on April 24, 2013. The vast majority of experiments were on English and restricted IP addresses of workers to within the US. Our experiments request that workers participate in each experiment only once.

  7. 7.

    The only quantitative data cited by Fort et al. (2011) to motivate this concern comes from Little (2009) who reports that, over a 75 day period in their lab at MIT’s Computer Science and Artificial Intelligence Lab, 22 % of their workers completed 80 % of their the tasks that they posted on AMT. However, these tasks are not linguistic experiments that request that workers participate only once per experiment, unlike for the results we report above from the Experimental Syntax-Semantics Lab at MIT.

  8. 8.

    For example, Cowart (1997, 2012) gives practical suggestions for systematically constructing item paradigms in Excel. Myers (2009b) presents MiniJudge, a tool designed to facilitating this process of constructing linguistic stimuli online.

  9. 9.

    Our supplied skeletons support choices introduced with buttons below the sentence, as in Fig. 1, or with a drop-down menu.

  10. 10.

    Some of these modifications require custom JavaScript programming in the template. Our own templates utilize the jQuery JavaScript library (http://jquery.com/), and we recommend its use for such custom programming.

  11. 11.

    Turktools is an ongoing, open-source project. The documentation will be continuously updated as necessary, and we encourage contributions by other users. Details can be found at: http://turktools.net/use/.

  12. 12.

    At the time of writing, these tools require the use of Python 2.6.x or 2.7.x, available at http://python.org. However, the tools described here and their prerequisites and usage are subject to change. Please consult the latest information at http://turktools.net before using these tools.

  13. 13.

    In the interest of space, we do not critically review the Gibson et al. (2011) paper and turkolizer tool.

  14. 14.

    The strengths and increased flexibility of WebExp and Ibex come with a higher technical barrier to entry than turktools, both in terms of experiment creation and in the deployment of their experiments. Both are written as server-side software packages that are designed to run on the researchers’ own servers, configured in a particular way. To recruit participants for WebExp/Ibex experiments on AMT, a simple template is used in AMT to redirect participants to the externally-hosted survey. AMT provides a sample template, called “Survey Link,” for such purposes. An additional step of cross-referencing submissions between the AMT and WebExp/Ibex submission results is then necessary in order to verify experiment participation in order to pay participants on AMT.


  1. Bard, Ellen Gurman, Dan Robertson, and Antonella Sorace. 1996. Magnitude estimation of linguistic acceptability. Language 72: 107–150.

    Article  Google Scholar 

  2. Berinsky, Adam J., Gregory A. Huber, and Gabriel S. Lenz. 2012. Evaluating online labor markets for experimental research: Amazon.com’s Mechanical Turk. Political Analysis.

  3. Buhrmester, Michael, Tracy Kwang, and Samuel D. Gosling. 2011. Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality data?. Perspectives on Psychological Science 6(1): 3–5.

    Article  Google Scholar 

  4. Cable, Seth, and Jesse Harris. 2011. On the grammatical status of PP-Pied-Piping in English: Results from sentence-rating experiments. In University of Massachusetts Occasional Papers in Linguistics: Processing Linguistic Structure, eds. Margaret Grant and Jesse Harris. Vol. 38, 1–22. Amherst: GLSA Publications.

    Google Scholar 

  5. Chemla, Emmanuel, and Benjamin Spector. 2011. Experimental evidence for embedded scalar implicatures. Journal of Semantics 28(3): 359–400. doi:10.1093/jos/ffq023.

    Article  Google Scholar 

  6. Chomsky, Noam. 1965. Aspects of the theory of syntax. Cambridge: MIT Press.

    Google Scholar 

  7. Cowart, Wayne. 1997. Experimental syntax: Applying objective methods to sentence judgments. Thousand Oaks: Sage Publications.

    Google Scholar 

  8. Cowart, Wayne. 2012. Doing experimental syntax: bridging the gap between syntactic questions and well-designed questionnaires. In In search of grammar: Experimental and corpus-based studies, ed. James Myers, 67–96.

    Google Scholar 

  9. Culicover, Peter W., and Ray Jackendoff. 2010. Quantitative methods alone are not enough: response to Gibson and Fedorenko. Trends in Cognitive Sciences 14(6): 234–235.

    Article  Google Scholar 

  10. Crump, Matthew J. C., John V. McDonnell, and Todd M. Gureckis. 2013. Evaluating Amazon’s Mechanical Turk as a tool for experimental behavioral research. PLoS ONE 8(3): e57410.

    Article  Google Scholar 

  11. Drummond, Alex. 2007. Ibex (Internet-based experiments). Software. https://code.google.com/p/webspr/.

  12. Edelman, Shimon, and Morten Christiansen. 2003. How seriously should we take minimalist syntax? Trends in Cognitive Sciences 7: 60–61.

    Article  Google Scholar 

  13. Featherston, Sam. 2005. Magnitude estimation and what it can do for your syntax: Some wh-constraints in German. Lingua 115: 1525–1550.

    Article  Google Scholar 

  14. Ferreira, Fernanda. 2005. Psycholinguistics, formal grammars, and cognitive science. The Linguistic Review 22: 365–380.

    Article  Google Scholar 

  15. Fort, Karën, Gilles Adda, and K. Bretonnel Cohen. 2011. Amazon Mechanical Turk: Gold mine or coal mine? Computational Linguistics 37(2): 413–420.

    Article  Google Scholar 

  16. Fukuda, Shin, Dan Michel, Henry Beecher, and Grant Goodall. 2010. Comparing three methods for sentence judgment experiments. Linguistic Society of America (LSA) Annual Meeting, Baltimore, MD.

  17. Gelman, Andrew, and Jennifer Hill. 2007. Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge University Press.

    Google Scholar 

  18. Germine, Laura, Ken Nakayama, Bradley C. Duchaine, Christopher F. Chabris, Garga Chatterjee, and Jeremy B. Wilmer. 2012. Is the Web as good as the lab? Comparable performance from web and lab in cognitive/perceptual experiments. Psychonomic Bulletin and Review 19(5).

  19. Gibson, Edward, and Evelina Fedorenko. 2010. Weak quantitative standards in linguistics research. Trends in Cognitive Sciences 14: 233–234.

    Article  Google Scholar 

  20. Gibson, Edward, Steve Piantadosi, and Kristina Fedorenko. 2011. Using Mechanical Turk to obtain and analyze English acceptability judgments. Language and Linguistics Compass 5(8): 509–524.

    Article  Google Scholar 

  21. Gosling, Samuel D., Simine Vazire, Sanjay Srivastava, and Oliver P. John. 2004. Should we trust web-based studies? A comparative analysis of six preconceptions about Internet questionnaires. The American Psychologist 59(2): 93–104.

    Article  Google Scholar 

  22. Horton, John J., David G. Rand, and Richard J. Zeckhauser. 2011. The online laboratory: Conducting experiments in a real labor market. Experimental Economics 14: 399–425.

    Article  Google Scholar 

  23. Huang, Yi Ting, Elizabeth Spelke, and Jesse Snedeker. 2013. What exactly do numbers mean? Language Learning and Development 9(2): 105–129.

    Article  Google Scholar 

  24. Ipeirotis, Panagiotis, Foster Provost, and Jing Wang. 2010. Quality management on Amazon Mechanical Turk. In HCOMP’10: Proceedings of the ACM SIGKDD Workshop on Human Computation 2, 64–67.

    Google Scholar 

  25. Ipeirotis, Panagiotis. 2010. Analyzing the Amazon Mechanical Turk Marketplace. ACM XRDS (Crossroads) 17(2): 16–21.

    Article  Google Scholar 

  26. Just, Marcel A., Patricia A. Carpenter, and Jacqueline D. Woolley. 1982. Paradigms and processes and in reading comprehension. Journal of Experimental Psychology: General 111: 228–238.

    Article  Google Scholar 

  27. Keller, Frank. 2000. Gradience in Grammar: Experimental and computational aspects of degrees of grammaticality. Ph.D. Thesis, University of Edinburgh.

  28. Keller, Frank, Martin Corley, Steffan Corley, Lars Konieczny, and Amalia Todirascu. 1998. WebExp: A Java toolbox for web-based psychological experiments (Technical Report No. HCRC/TR-99). Human Communication Research Centre, University of Edinburgh.

  29. Keller, Frank, Subahshini Gunasekharan, Neil Mayo, and Martin Corley. 2009. Timing accuracy of web experiments: A case study using the WebExp software package. Behavior Research Methods 41(1): 1–12.

    Article  Google Scholar 

  30. Kotek, Hadas, Yasutada Sudo, Edwin Howard, and Martin Hackl. 2011. Most meanings are superlative. In Syntax and semantics 37: Experiments at the interfaces, ed. Jeff Runner, 101–145.

    Google Scholar 

  31. Langendoen, Terence D., Nancy Kalish-Landon, and John Dore. 1973. Dative questions: A study of the relation of acceptability to grammaticality of an English sentence type. Cognition 2: 451–478.

    Article  Google Scholar 

  32. Little, Greg. 2009. How many turkers are there? Deneme: A blog of experiments on Amazon Mechanical Turk. http://groups.csail.mit.edu/uid/deneme/?p=502. Retrieved March 28, 2014.

  33. Marantz, Alec. 2005. Generative linguistics within the cognitive neuroscience of language. The Linguistic Review 22: 429–445.

    Article  Google Scholar 

  34. Mason, Winter, and Siddarth Suri. 2012. Conducting behavioral experiments on Amazon’s Mechanical Turk. Behavior Research Methods 44: 1–23.

    Article  Google Scholar 

  35. Milsark, Gary. 1974. Existential sentences in English, Doctoral dissertation, MIT.

  36. Milsark, Gary. 1977. Toward an explanation of certain peculiarities of the existential construction in English. Linguistic Analysis 3: 1–29.

    Google Scholar 

  37. Munro, Robert, Steven Bethard, Victor Kuperman, Vicky Tzuyin Lai, Robin Melnick, Christopher Potts, Tyler Schnoebelen, and Harry Tily. 2010. Crowdsourcing and language studies: the new generation of linguistic data. In Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, CA.

    Google Scholar 

  38. Myers, James. 2009a. Syntactic judgment experiments. Language and Linguistics Compass 3: 406–423.

    Article  Google Scholar 

  39. Myers, James. 2009b. The design and analysis of small-scale syntactic judgment experiments. Lingua 119: 425–444.

    Article  Google Scholar 

  40. Paolacci, Gabriele, Jesse Chandler, and Panagiotis Ipeirotis. 2010. Running experiments on Amazon Mechanical Turk. Judgment and Decision Making 5(5): 411–419.

    Google Scholar 

  41. Pearson, Hazel, Manizeh Khan, and Jesse Snedeker. 2010. Even more evidence for the emptiness of plurality: An experimental investigation of plural interpretation as a species of implicature. Semantic and Linguistic Theory (SALT) 20: 489–508.

    Article  Google Scholar 

  42. Phillips, Colin. 2010. Should we impeach armchair linguists? In Japanese-Korean Linguistics (JK17), eds. Soichi Iwasaki, Hajime Hoji, Patricia Clancy, Devyani Sharma, and Sung-Och Sohn. Vol. 17, 49–64. Stanford: CSLI Publications.

    Google Scholar 

  43. Phillips, Colin, and Howard Lasnik. 2003. Linguistics and empirical evidence: Reply to Edelman and Christiansen. Trends in Cognitive Sciences 7: 61–62.

    Article  Google Scholar 

  44. Reips, Ulf-Dietrich. 2002. Standards for Internet-based experimenting. Experimental Psychology 49(4): 243–256.

    Article  Google Scholar 

  45. Schütze, Carson. 1996. The empirical base of linguistics: grammaticality judgments and linguistic methodology. Chicago: University of Chicago Press.

    Google Scholar 

  46. Schütze, Carson, and Jon Sprouse. 2013. Judgment data. In Research Methods in Linguistics, eds. Robert J. Podesva and Devyani Sharma, 27–50. Cambridge: Cambridge University Press.

    Google Scholar 

  47. Shapiro, Danielle, Jesse Chandler, and Pam Mueller. 2013. Using Mechanical Turk to study clinical populations. Clinical Psychological Science 1(2): 213–220.

    Article  Google Scholar 

  48. Sprouse, Jon. 2009. Revisiting satiation: Evidence for an equalization response strategy. Linguistic Inquiry 40: 329–341.

    Article  Google Scholar 

  49. Sprouse, Jon. 2011. A validation of Amazon Mechanical Turk for the collection of acceptability judgments in linguistic theory. Behavior Research Methods 43: 155–167.

    Article  Google Scholar 

  50. Sprouse, Jon. 2013. Acceptability judgments. In Oxford Bibliographies Online: Linguistics, ed. Mark Aronoff.

    Google Scholar 

  51. Sprouse, Jon, and Diogo Almeida. 2012. Assessing the reliability of textbook data in syntax: Adger’s Core Syntax. Journal of Linguistics 48: 609–652.

    Article  Google Scholar 

  52. Sprouse, Jon, and Diogo Almeida. 2013. The empirical status of data in syntax: a reply to Gibson and Fedorenko. Language and Cognitive Processes 28(3): 222–228.

    Article  Google Scholar 

  53. Sprouse, Jon, Carson Schütze, and Diogo Almeida. 2013. A comparison of informal and formal acceptability judgments using a random sample from Linguistic Inquiry 2001–2010. Lingua 134: 219–248.

    Article  Google Scholar 

  54. Tamir, D. 2011. 50,000 Worldwide Mechanical Turk workers. Techlist, http://techlist.com/mturk/global-mturk-worker-map.php.

  55. Tily, Harry, and Edward Gibson. 2015, in preparation. Self-paced reading on Mechanical Turk.

  56. Wasow, Thomas, and Jennifer Arnold. 2005. Intuitions in linguistic argumentation. Lingua 115: 1481–1496.

    Article  Google Scholar 

  57. Weskott, Thomas, and Gisbert Fanselow. 2011. On the informativity of different measures of linguistic acceptability. Language 87: 249–273.

    Article  Google Scholar 

Download references


For helpful comments and discussion of this paper and the associated tools, we would like to thank Martin Hackl, David Pesetsky, Coppe van Urk, and participants of our 2013 workshop at MIT on designing linguistic experiments for Mechanical Turk. The current paper has also greatly benefited from the feedback of four anonymous NLLT reviewers, as well as the editor Marcel den Dikken. Any and all errors are ours.

Author information



Corresponding author

Correspondence to Michael Yoshitaka Erlewine.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(PDF 320 kB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Erlewine, M.Y., Kotek, H. A streamlined approach to online linguistic surveys. Nat Lang Linguist Theory 34, 481–495 (2016). https://doi.org/10.1007/s11049-015-9305-9

Download citation


  • Experimental methods
  • Online surveys
  • Web-based experiments
  • Crowdsourcing
  • Amazon Mechanical Turk
  • Software