More and more researchers in linguistics use large-scale experiments to test hypotheses about the data they research, in addition to more traditional informant work. In this paper we describe a new set of free, open-source tools that allow linguists to post studies online, turktools. These tools allow for the creation of a wide range of linguistic tasks, including grammaticality surveys, sentence completion tasks, and picture-matching tasks, allowing for easily implemented large-scale linguistic studies. Our tools further help streamline the design of such experiments and assist in the extraction and analysis of the resulting data. Surveys created using the tools described in this paper can be posted on Amazon’s Mechanical Turk service, a popular crowdsourcing platform that mediates between ‘Requesters’ who can post surveys online and ‘Workers’ who complete them. This allows many linguistic surveys to be completed within hours or days and at relatively low costs. Alternatively, researchers can host these randomized experiments on their own servers using a supplied server-side component.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
In AMT jargon, these tasks are called Human Intelligence Tasks, or HITs. The organization of linguistic surveys into HITs will be discussed in the Appendix (online).
The process of designing an experiment can itself be very valuable. As is often the case, expanding the scope of one’s investigation can lead to interesting findings about the factors that affect the phenomenon in question. Although this goal by itself can be achieved without experimentation, we believe that the exercise of turning a theoretical research question into a testable set of experimental predictions can inform one’s thinking about the problem.
Participants in university lab settings often tend to be college students, and hence have a restricted distribution of age, education, and socio-economic status.
An anonymous reviewer asks whether there has been a comparison of AMT and lab data for tasks involving timing, for example for Self-Paced Reading. To the best of our knowledge, although there is ongoing work attempting to answer this question (see Tily and Gibson 2015), there are no published results.
A screen capture of this map can be found at http://turktools.net/crowdsourcing/.
Data collected on April 24, 2013. The vast majority of experiments were on English and restricted IP addresses of workers to within the US. Our experiments request that workers participate in each experiment only once.
The only quantitative data cited by Fort et al. (2011) to motivate this concern comes from Little (2009) who reports that, over a 75 day period in their lab at MIT’s Computer Science and Artificial Intelligence Lab, 22 % of their workers completed 80 % of their the tasks that they posted on AMT. However, these tasks are not linguistic experiments that request that workers participate only once per experiment, unlike for the results we report above from the Experimental Syntax-Semantics Lab at MIT.
Our supplied skeletons support choices introduced with buttons below the sentence, as in Fig. 1, or with a drop-down menu.
Turktools is an ongoing, open-source project. The documentation will be continuously updated as necessary, and we encourage contributions by other users. Details can be found at: http://turktools.net/use/.
In the interest of space, we do not critically review the Gibson et al. (2011) paper and turkolizer tool.
The strengths and increased flexibility of WebExp and Ibex come with a higher technical barrier to entry than turktools, both in terms of experiment creation and in the deployment of their experiments. Both are written as server-side software packages that are designed to run on the researchers’ own servers, configured in a particular way. To recruit participants for WebExp/Ibex experiments on AMT, a simple template is used in AMT to redirect participants to the externally-hosted survey. AMT provides a sample template, called “Survey Link,” for such purposes. An additional step of cross-referencing submissions between the AMT and WebExp/Ibex submission results is then necessary in order to verify experiment participation in order to pay participants on AMT.
Bard, Ellen Gurman, Dan Robertson, and Antonella Sorace. 1996. Magnitude estimation of linguistic acceptability. Language 72: 107–150.
Berinsky, Adam J., Gregory A. Huber, and Gabriel S. Lenz. 2012. Evaluating online labor markets for experimental research: Amazon.com’s Mechanical Turk. Political Analysis.
Buhrmester, Michael, Tracy Kwang, and Samuel D. Gosling. 2011. Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality data?. Perspectives on Psychological Science 6(1): 3–5.
Cable, Seth, and Jesse Harris. 2011. On the grammatical status of PP-Pied-Piping in English: Results from sentence-rating experiments. In University of Massachusetts Occasional Papers in Linguistics: Processing Linguistic Structure, eds. Margaret Grant and Jesse Harris. Vol. 38, 1–22. Amherst: GLSA Publications.
Chemla, Emmanuel, and Benjamin Spector. 2011. Experimental evidence for embedded scalar implicatures. Journal of Semantics 28(3): 359–400. doi:10.1093/jos/ffq023.
Chomsky, Noam. 1965. Aspects of the theory of syntax. Cambridge: MIT Press.
Cowart, Wayne. 1997. Experimental syntax: Applying objective methods to sentence judgments. Thousand Oaks: Sage Publications.
Cowart, Wayne. 2012. Doing experimental syntax: bridging the gap between syntactic questions and well-designed questionnaires. In In search of grammar: Experimental and corpus-based studies, ed. James Myers, 67–96.
Culicover, Peter W., and Ray Jackendoff. 2010. Quantitative methods alone are not enough: response to Gibson and Fedorenko. Trends in Cognitive Sciences 14(6): 234–235.
Crump, Matthew J. C., John V. McDonnell, and Todd M. Gureckis. 2013. Evaluating Amazon’s Mechanical Turk as a tool for experimental behavioral research. PLoS ONE 8(3): e57410.
Drummond, Alex. 2007. Ibex (Internet-based experiments). Software. https://code.google.com/p/webspr/.
Edelman, Shimon, and Morten Christiansen. 2003. How seriously should we take minimalist syntax? Trends in Cognitive Sciences 7: 60–61.
Featherston, Sam. 2005. Magnitude estimation and what it can do for your syntax: Some wh-constraints in German. Lingua 115: 1525–1550.
Ferreira, Fernanda. 2005. Psycholinguistics, formal grammars, and cognitive science. The Linguistic Review 22: 365–380.
Fort, Karën, Gilles Adda, and K. Bretonnel Cohen. 2011. Amazon Mechanical Turk: Gold mine or coal mine? Computational Linguistics 37(2): 413–420.
Fukuda, Shin, Dan Michel, Henry Beecher, and Grant Goodall. 2010. Comparing three methods for sentence judgment experiments. Linguistic Society of America (LSA) Annual Meeting, Baltimore, MD.
Gelman, Andrew, and Jennifer Hill. 2007. Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge University Press.
Germine, Laura, Ken Nakayama, Bradley C. Duchaine, Christopher F. Chabris, Garga Chatterjee, and Jeremy B. Wilmer. 2012. Is the Web as good as the lab? Comparable performance from web and lab in cognitive/perceptual experiments. Psychonomic Bulletin and Review 19(5).
Gibson, Edward, and Evelina Fedorenko. 2010. Weak quantitative standards in linguistics research. Trends in Cognitive Sciences 14: 233–234.
Gibson, Edward, Steve Piantadosi, and Kristina Fedorenko. 2011. Using Mechanical Turk to obtain and analyze English acceptability judgments. Language and Linguistics Compass 5(8): 509–524.
Gosling, Samuel D., Simine Vazire, Sanjay Srivastava, and Oliver P. John. 2004. Should we trust web-based studies? A comparative analysis of six preconceptions about Internet questionnaires. The American Psychologist 59(2): 93–104.
Horton, John J., David G. Rand, and Richard J. Zeckhauser. 2011. The online laboratory: Conducting experiments in a real labor market. Experimental Economics 14: 399–425.
Huang, Yi Ting, Elizabeth Spelke, and Jesse Snedeker. 2013. What exactly do numbers mean? Language Learning and Development 9(2): 105–129.
Ipeirotis, Panagiotis, Foster Provost, and Jing Wang. 2010. Quality management on Amazon Mechanical Turk. In HCOMP’10: Proceedings of the ACM SIGKDD Workshop on Human Computation 2, 64–67.
Ipeirotis, Panagiotis. 2010. Analyzing the Amazon Mechanical Turk Marketplace. ACM XRDS (Crossroads) 17(2): 16–21.
Just, Marcel A., Patricia A. Carpenter, and Jacqueline D. Woolley. 1982. Paradigms and processes and in reading comprehension. Journal of Experimental Psychology: General 111: 228–238.
Keller, Frank. 2000. Gradience in Grammar: Experimental and computational aspects of degrees of grammaticality. Ph.D. Thesis, University of Edinburgh.
Keller, Frank, Martin Corley, Steffan Corley, Lars Konieczny, and Amalia Todirascu. 1998. WebExp: A Java toolbox for web-based psychological experiments (Technical Report No. HCRC/TR-99). Human Communication Research Centre, University of Edinburgh.
Keller, Frank, Subahshini Gunasekharan, Neil Mayo, and Martin Corley. 2009. Timing accuracy of web experiments: A case study using the WebExp software package. Behavior Research Methods 41(1): 1–12.
Kotek, Hadas, Yasutada Sudo, Edwin Howard, and Martin Hackl. 2011. Most meanings are superlative. In Syntax and semantics 37: Experiments at the interfaces, ed. Jeff Runner, 101–145.
Langendoen, Terence D., Nancy Kalish-Landon, and John Dore. 1973. Dative questions: A study of the relation of acceptability to grammaticality of an English sentence type. Cognition 2: 451–478.
Little, Greg. 2009. How many turkers are there? Deneme: A blog of experiments on Amazon Mechanical Turk. http://groups.csail.mit.edu/uid/deneme/?p=502. Retrieved March 28, 2014.
Marantz, Alec. 2005. Generative linguistics within the cognitive neuroscience of language. The Linguistic Review 22: 429–445.
Mason, Winter, and Siddarth Suri. 2012. Conducting behavioral experiments on Amazon’s Mechanical Turk. Behavior Research Methods 44: 1–23.
Milsark, Gary. 1974. Existential sentences in English, Doctoral dissertation, MIT.
Milsark, Gary. 1977. Toward an explanation of certain peculiarities of the existential construction in English. Linguistic Analysis 3: 1–29.
Munro, Robert, Steven Bethard, Victor Kuperman, Vicky Tzuyin Lai, Robin Melnick, Christopher Potts, Tyler Schnoebelen, and Harry Tily. 2010. Crowdsourcing and language studies: the new generation of linguistic data. In Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, CA.
Myers, James. 2009a. Syntactic judgment experiments. Language and Linguistics Compass 3: 406–423.
Myers, James. 2009b. The design and analysis of small-scale syntactic judgment experiments. Lingua 119: 425–444.
Paolacci, Gabriele, Jesse Chandler, and Panagiotis Ipeirotis. 2010. Running experiments on Amazon Mechanical Turk. Judgment and Decision Making 5(5): 411–419.
Pearson, Hazel, Manizeh Khan, and Jesse Snedeker. 2010. Even more evidence for the emptiness of plurality: An experimental investigation of plural interpretation as a species of implicature. Semantic and Linguistic Theory (SALT) 20: 489–508.
Phillips, Colin. 2010. Should we impeach armchair linguists? In Japanese-Korean Linguistics (JK17), eds. Soichi Iwasaki, Hajime Hoji, Patricia Clancy, Devyani Sharma, and Sung-Och Sohn. Vol. 17, 49–64. Stanford: CSLI Publications.
Phillips, Colin, and Howard Lasnik. 2003. Linguistics and empirical evidence: Reply to Edelman and Christiansen. Trends in Cognitive Sciences 7: 61–62.
Reips, Ulf-Dietrich. 2002. Standards for Internet-based experimenting. Experimental Psychology 49(4): 243–256.
Schütze, Carson. 1996. The empirical base of linguistics: grammaticality judgments and linguistic methodology. Chicago: University of Chicago Press.
Schütze, Carson, and Jon Sprouse. 2013. Judgment data. In Research Methods in Linguistics, eds. Robert J. Podesva and Devyani Sharma, 27–50. Cambridge: Cambridge University Press.
Shapiro, Danielle, Jesse Chandler, and Pam Mueller. 2013. Using Mechanical Turk to study clinical populations. Clinical Psychological Science 1(2): 213–220.
Sprouse, Jon. 2009. Revisiting satiation: Evidence for an equalization response strategy. Linguistic Inquiry 40: 329–341.
Sprouse, Jon. 2011. A validation of Amazon Mechanical Turk for the collection of acceptability judgments in linguistic theory. Behavior Research Methods 43: 155–167.
Sprouse, Jon. 2013. Acceptability judgments. In Oxford Bibliographies Online: Linguistics, ed. Mark Aronoff.
Sprouse, Jon, and Diogo Almeida. 2012. Assessing the reliability of textbook data in syntax: Adger’s Core Syntax. Journal of Linguistics 48: 609–652.
Sprouse, Jon, and Diogo Almeida. 2013. The empirical status of data in syntax: a reply to Gibson and Fedorenko. Language and Cognitive Processes 28(3): 222–228.
Sprouse, Jon, Carson Schütze, and Diogo Almeida. 2013. A comparison of informal and formal acceptability judgments using a random sample from Linguistic Inquiry 2001–2010. Lingua 134: 219–248.
Tamir, D. 2011. 50,000 Worldwide Mechanical Turk workers. Techlist, http://techlist.com/mturk/global-mturk-worker-map.php.
Tily, Harry, and Edward Gibson. 2015, in preparation. Self-paced reading on Mechanical Turk.
Wasow, Thomas, and Jennifer Arnold. 2005. Intuitions in linguistic argumentation. Lingua 115: 1481–1496.
Weskott, Thomas, and Gisbert Fanselow. 2011. On the informativity of different measures of linguistic acceptability. Language 87: 249–273.
For helpful comments and discussion of this paper and the associated tools, we would like to thank Martin Hackl, David Pesetsky, Coppe van Urk, and participants of our 2013 workshop at MIT on designing linguistic experiments for Mechanical Turk. The current paper has also greatly benefited from the feedback of four anonymous NLLT reviewers, as well as the editor Marcel den Dikken. Any and all errors are ours.
Electronic Supplementary Material
Below is the link to the electronic supplementary material.
About this article
Cite this article
Erlewine, M.Y., Kotek, H. A streamlined approach to online linguistic surveys. Nat Lang Linguist Theory 34, 481–495 (2016). https://doi.org/10.1007/s11049-015-9305-9