A validation of Amazon Mechanical Turk for the collection of acceptability judgments in linguistic theory
- 4.4k Downloads
Amazon’s Mechanical Turk (AMT) is a Web application that provides instant access to thousands of potential participants for survey-based psychology experiments, such as the acceptability judgment task used extensively in syntactic theory. Because AMT is a Web-based system, syntacticians may worry that the move out of the experimenter-controlled environment of the laboratory and onto the user-controlled environment of AMT could adversely affect the quality of the judgment data collected. This article reports a quantitative comparison of two identical acceptability judgment experiments, each with 176 participants (352 total): one conducted in the laboratory, and one conducted on AMT. Crucial indicators of data quality—such as participant rejection rates, statistical power, and the shape of the distributions of the judgments for each sentence type—are compared between the two samples. The results suggest that aside from slightly higher participant rejection rates, AMT data are almost indistinguishable from laboratory data.
KeywordsAmazon Mechanical Turk Acceptability judgments Grammaticality judgments Experimental syntax Linguistic theory
From a purely methodological point of view, syntacticians are interested in identifying the properties of syntactic representations. Over the past 50 years, the dominant method for identifying the properties of syntactic representations has involved comparing two (or more) minimally different representations using a behavioral response known as an acceptability judgment as a proxy for grammatical well-formedness (Chomsky, 1965; Schütze, 1996). Traditionally, these acceptability judgments have been collected using an informal experiment consisting of only a handful of participants (usually the researcher’s colleagues) and a handful of experimental items (Marantz, 2005). This informal methodology has worked well because acceptability judgments of linguistic phenomena tend to be strikingly robust, even at very small sample sizes (for a large-scale quantitative evaluation, see Sprouse & Almeida, 2010). The success of informal experiments notwithstanding, over the past 15 years, a number of syntacticians have argued that formal experimental methods—such as full-scale surveys, large samples, and sophisticated scaling tasks like magnitude estimation—can provide an additional level of detail (usually in the form of statistical models) that can help clarify some theoretical questions in syntactic theory (e.g., Bard, Robertson, & Sorace, 1996; Cowart, 1997; Featherston, 2005a, 2005b; Keller, 2000; Myers, 2009; Sorace & Keller, 2004; Sprouse, 2009; Sprouse & Cunningham, submitted for publication; Sprouse, Wagers, & Phillips, 2010). Of course, the additional information gained by formal acceptability experiments is offset by the fact that they take considerably more time to deploy than informal acceptability experiments: an informal experiment can be conducted in a matter of minutes, whereas formal experiments can require several weeks for recruiting and running a full sample (e.g., 25–30 participants).
Several free software solutions, such as WebExp (Keller, Gunasekharan, Mayo, & Corley, 2009) and MiniJudge (Myers, 2009), have been developed to allow acceptability judgments to be collected over the Web, and thus reduce some of the collection time. Though successful at reducing physical data collection time, these software solutions still require the experimenter to invest time in participant recruitment (and compensation disbursement), which can still take weeks to complete. It has been recently suggested that syntacticians could use the Amazon Mechanical Turk marketplace (henceforth, AMT) to completely automate the recruitment of participants, the administration of surveys, and the disbursement of compensation, thus virtually eliminating the time cost of formal experiments (see, e.g., Gibson & Fedorenko, in press). AMT is an online marketplace where companies or individuals (called requesters) can post small tasks (called Human Intelligence Tasks, or HITs) that cannot easily be automated, and therefore require human workers (called workers) for completion. These HITs are generally very small in nature (such as identifying the contents of an image), and generally very high in quantity (it is not unusual for requesters to post thousands of tasks in a single batch). Requesters generally pay very little per HIT (e.g., $0.02 U.S.) and retain the ability to accept or reject the results of each HIT before Amazon sends payment to the worker. In this way, requesters are able to crowdsource (cf. outsource) tasks that would previously have required hours of work by in-house employees at considerably more expensive compensation rates. HITs can be posted using an online interface (www.mturk.com), and results can be downloaded in CSV format. From the point of view of an experimenter, AMT provides instantaneous access to thousands of potential participants and provides the tools necessary to distribute surveys, collect responses, and disburse payments.
It should be noted that AMT has already proven useful in at least one area of language research, computational linguistics, where it has been used for corpus annotation and evaluation—two tasks that have historically consumed significant time and resources (see, e.g., the recent NAACL HLT 2010 Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk; proceedings available online at www.aclweb.org/anthology/W/W10/W10-07.pdf). However, AMT has yet to be widely adopted by syntacticians who run formal acceptability experiments. The primary concern among syntacticians is that moving formal acceptability judgments out of the experimenter-controlled environment of the laboratory and onto the user-controlled environment of AMT may adversely affect the quality of the data collected and potentially negate the quantitative advantages that motivate formal experiments in the first place. In the laboratory, the experimenter can ensure that all participants are part of the population of interest (e.g., native speakers of U.S. English), control the environmental distractions, influence the rate of completion (“don’t rush”), verify that participants understand the task, and answer any questions that may arise. Before syntacticians can widely adopt AMT, they will need to be reasonably sure that the loss of this control will not affect the quality of the data that are collected. To that end, the goal of this article is to compare the results of a large-scale laboratory-based experiment (176 participants) and an identical AMT-based experiment (176 participants) along all of the quantitative measures of interest to linguists: time, cost (in money), participant rejection rate, detection rates of several known effects (both strong and weak) at a range of sample sizes, and differences in the shapes of the distributions of ratings for each condition (peak, dispersion, etc.).
Quantitative validation studies such as this require two large data sets: a reference data set and a target (AMT) data set. Given the relative scarcity of funding in linguistics, it seems unlikely that syntacticians will devote their limited resources to collecting two large data sets simply to validate AMT. However, Sprouse, Wagers, and Phillips (2010) collected a large data set as part of a theoretically motivated study: 176 participants, 24 different sentence types, 16 different lexicalizations (tokens) of each sentence type, and four judgments per sentence type per participant. This data set serves as the reference data for the AMT validation. The details of the experiment are given in the rest of this section.
A group of 176 (152 female) self-reported monolingual native speakers of English, all University of California Irvine undergraduates, participated in the laboratory experiment for either course credit or $5. Another 176 (102 female) unique AMT workers participated in the AMT experiment for $3.
A total of 24 sentence types (conditions) were tested in this experiment. Sixteen lexicalizations of each sentence type were created and distributed among four lists using a Latin-square procedure. This meant that each list consisted of four tokens per sentence type, for a total of 96 items per list. Two orders for each of the four lists were created by pseudorandomizing the items such that related sentence types were never presented successively. This resulted in eight different surveys.
An Example of Magnitude Estimation of Acceptability
Who said my brother was kept tabs on by the FBI? 100
What did Lisa meet the man that bought? ____
The standard and modulus do not change throughout the experiment. Participants are instructed that they can use any positive number that they feel is appropriate. The standard was identical for all eight surveys and was in the middle range of acceptability: Who said my brother was kept tabs on by the FBI?
Presentation in the laboratory
The experiment began with a practice phase during which participants estimated the lengths of seven lines using another line as a standard set to a modulus of 100. This practice phase ensured that participants understood the concept of magnitude estimation. During the main phase of the experiment, 10 items were presented per page (except for the final page), with the standard appearing at the top of every page inside a textbox with black borders. The first 9 items of the survey were practice items (3 each of low, medium, and high acceptability). These practice items were not marked as such—that is, the participants did not know they were practice items—and they did not vary between participants in order or lexicalization. Including the practice items, each survey was 105 items long. The task directions are available on the author’s Web site (www.ling.cogsci.uci.edu/~jsprouse/tools/amt/). Participants were under no time constraints during their visit.
Presentation on AMT
Preprocessing of responses
The responses to the nine practice items were removed, and the remaining responses for each participant were z-score transformed prior to analysis. The z-score transformation is a standardization procedure that corrects for some kinds of scale bias between participants by converting a participant’s scores into units that convey the number of standard deviations each score is from that participant’s mean score.
Case studies for analysis
- (2)Whether Island Effect
What do you think that John bought?
*What do you wonder whether John bought?
- (3)Complex Noun Phrase Island Effect
What did you claim that John bought?
*What did you make the claim that John bought?
- (4)Subject Island Effect
What do you think interrupted the TV show?
*What do you think the speech about interrupted the TV show?
- (5)Adjunct Island Effect
What do you think that John forgot at the office?
*What do you worry if John forgets at the office?
The next three case studies are contrasts that have historically proven particularly difficult to replicate in acceptability judgment tasks, but are nonetheless detectable with very large sample sizes like those in this study (Sprouse & Almeida, 2010). They are the center embedding illusion (e.g., Frazier, 1985; Gibson & Thomas, 1999), the comparative illusion (e.g., Phillips, Wagers, & Lau, in press), and the agreement attraction illusion (e.g., Wagers, Lau, & Phillips, 2009). These contrasts are likely difficult to detect with acceptability judgments because they are not caused by a static property of the syntactic representations, but rather by the way the sentences are processed. Such processing-based effects are generally investigated using measures with high temporal resolution, such as reaction times or event-related potentials, rather than untimed acceptability judgments; however, these three contrasts have been reported using untimed acceptability judgments, and therefore provide an interesting case study in the detection of extremely weak effects using an AMT sample.
- (6)Center Embedding Illusion
*The ancient manuscript that the grad student who the new card catalog had confused a great deal was studying in the library was missing a page.
?The ancient manuscript that the grad student who the new card catalog had confused a great deal was missing a page.
- (7)Comparative Illusion
*More people have graduated law school than I have.
?More people have been to Russia than I have.
- (8)Agreement Attraction Illusion
*The slogan on the poster unsurprisingly were designed to get attention.
?The slogan on the posters unsurprisingly were designed to get attention.
Time, cost, and participant rejection
There are many aspects of the experimental procedure that could be affected by the change of venue from the laboratory to AMT, such as the time it takes to create and run the experiment, the methods available for ensuring an appropriate sample (e.g., only native speakers of English), and the number of participants that must be removed from the sample prior to analysis. This section provides an in-depth comparison of these preanalysis aspects of the experimental procedure.
Laboratory experiments require the use of experimental software (e.g., WebExp, MiniJudge) or the creation of paper surveys; AMT experiments require the creation of an HTML survey. It took about 3 h to explore the AMT documentation (tutorials and discussion threads), and another hour to create the HTML template for the surveys, for a total of 4 h of initial setup time, which seems comparable to the initial setup of other software options. This is a one-time investment, and the HTML template is reusable; therefore, additional experiments will take only a matter of minutes to publish. The HTML template used here can be downloaded for free from the author’s Web site (www.ling.cogsci.uci.edu/~jsprouse/tools/amt/).
The primary advantage of AMT is in data collection. The laboratory-based sample took approximately 88 experimenter hours spread over a 3-month period, whereas AMT returned 170 surveys in 2 h. That is a rate of 85 participants per hour. Because a few of the participants were excluded during data collection (see the Participant Rejection section below), the total time to collect 176 correctly completed surveys was 4 h. These rates suggest that a standard-sized sample (25–35 participants) could be collected in less than 1 h using AMT.
The laboratory-based participants were paid $5 or given course credit for a 30-min visit to the laboratory. The AMT participants were paid $3 per survey. The $3 compensation rate was chosen on the basis of the other HITs available on AMT: HITs generally pay $0.02 per single task, and these surveys required 105 judgments in addition to the reading of detailed instructions. AMT charges a 10% fee in addition to the compensation given to workers, so the total participant compensation cost was $3.30 per participant ($580.80 for 176 participants). The participant compensation cost of AMT is likely to be a concern for linguists without funding. Whereas laboratory-based experiments can be run at no cost through the use of university participant pools that grant course credit, the AMT system is cash only. At these rates, a standard 30 participant/100 item experiment on AMT would cost approximately $100.
Participants must be native speakers of the language of interest (e.g., U.S. English).
Participants must take the experiment only once.
The AMT documentation indicates that requesters can require that workers complete a qualification exam prior to completing HITs. These qualification exams are intended to assess the worker’s skill at a particular task. It is theoretically possible to create a qualification exam that will screen out nonnative speakers and participants who have already completed a related survey. However, workers can retake qualification exams. This means that a worker who is disqualified for being a nonnative speaker can potentially retake the exam and change his or her answers to avoid disqualification. This situation is not ideal, as it potentially encourages misrepresentation. Furthermore, several discussion threads on the AMT forum suggest that qualification exams severely decrease participation rates, as many AMT workers routinely ignore HITs that require qualification.
You lived in the United States from birth until age 13.
Both of your parents spoke English to you during those years.
Participants were paid $3 regardless of their answers to these criteria. This ensured that there was no incentive to answer untruthfully and that the responses could be used to reject participants prior to analysis. Only 3 participants answered NO to one or more of the native speaker criteria. These 3 participants were still compensated for their time, so $9.90 was lost to self-identified nonnative speakers.
To ensure that participants only completed one of the eight surveys that were part of this experiment, a paragraph was placed at the end of the survey (after all of the judgments) that instructed workers not to take any of the seven other HITs available as part of this HIT batch. They were told that they would only be paid for the first survey that they completed, so there was no monetary incentive to complete additional HITs in this batch. Because AMT assigns each worker a unique alphanumeric ID number, it is relatively straightforward to search the results for workers who have completed multiple surveys and to reject their later surveys using the AMT approval/rejection feature. If a worker is rejected through the approval/rejection feature, he or she is not compensated for that HIT, and that HIT is automatically returned to the list of available HITs to be completed by a different worker. The approval/rejection feature thus ensures that there is no monetary incentive for workers to take more than one survey in a single experiment. One participant submitted three surveys. Only the first was approved; the other two were rejected and returned to the AMT system for completion by other participants.
Because laboratory experiments are conducted in person, there are generally no false submissions. There can be participants who fail to show for a scheduled appointment, but at many universities there are penalties to dissuade no-shows. On the AMT system, there are no such penalties. Seven participants submitted incomplete surveys. These participants were rejected using the AMT rejection/approval system, which means that they were not compensated for their time, and their surveys were automatically returned to the AMT system to be taken by other participants. Together with the two repeated surveys mentioned in the previous subsection, this means that 9 out of 176 surveys were rejected using the AMT rejection/approval system and returned to the AMT system (5.1%). Identifying these 9 surveys took less than 10 min of experimenter time and resulted in no monetary loss.
Because acceptability judgments are by definition subjective (there is no external measurement method), there are no universally agreed upon criteria for identifying participants who are not performing the task correctly. One possibility explored by Sprouse and Cunningham (submitted for publication) was to plot the mean ratings of each condition in ascending order and identify a subset of conditions that appear to have a definitive rank order in the sample mean data. The rank order of those items could then be computed for each participant and compared to their rank order in the sample mean data (the “true” ordering) to derive a measure of divergence between each participant’s rank order and the sample rank order. One such measure of rank order comparison is the tau rank correlation (Kendall, 1938). The tau rank correlation is based on Kendall’s tau, which is a distance measure between two rank orders based on how many pairwise “flips” of adjacent numbers are necessary to turn one rank order into another. The tau rank correlation yields a coefficient for each participant between –1 and 1. A perfect match between the two ranks yields a 1, no relation between two ranks yields a 0, and the most dissimilar rank yields a –1. The tau rank correlation coefficients can then be plotted in a histogram to identify any participants whose rank order is qualitatively different from the sample rank order. Crucially, for the purposes of this report, this procedure does not have to be the best possible outlier identification procedure; it merely has to return results that (1) are logically interpretable and (2) allow for a comparison to be made between the two samples.
- (9)Examples of the Eight Conditions Chosen for the Rank Order Analysis
What do you worry if the lawyer forgets at the office?
What does the detective wonder whether Paul took?
The slogan on the poster unsurprisingly were designed to get attention.
The slogan on the posters unsurprisingly were designed to get attention.
Who worries if the lawyer forgets his briefcase at the office?
What does the detective think Paul took?
Who made the claim that Amy stole the pizza?
Who thinks Paul took the necklace?
The tau coefficients for the laboratory sample are much more tightly clustered at the high end of the scale than the AMT sample, which has a much heavier leftward tail. At a practical level, this means that it is much easier to identify outliers in the laboratory sample: the 3 participants with tau coefficients below 0 are obviously distinct from the primary mass of participants. Furthermore, their negative tau coefficients indicate that their rank order was nearly reverse from the sample rank order. The picture is less clear for the AMT sample. A large majority of the participants still have tau coefficients above .5, but there are many more participants with tau coefficients near or below 0, and there is a less clear separation between the primary mass of participants and the potential outliers. Adopting a cutoff criterion similar to the one for the laboratory sample (~.15) results in the elimination of 22 participants from the AMT sample and coincides with a minor mode in the tail of the distribution. The fact that this criterion is difficult to establish without a comparison to the laboratory sample raises a potential problem for the use of this method of participant removal with AMT samples; however, for the purposes of this validation study, it provides us with a conservative estimate that is logically comparable to the laboratory sample.
In total, 25 out of 176 participants (14.2%) were excluded from the AMT sample for either self-identifying as nonnative (3) or providing results in which the rank order differed significantly from the sample rank order (22). Although the AMT rejection rate appears to compare unfavorably with the 3 rejections for the laboratory sample (1.7%), it should be noted that 14.2% is well within the range of rejection rates for other behavioral methodologies such as self-paced reading and lexical decision, and lower than the rejection rates for electrophysiological methodologies such as EEG and MEG. The minor increase in participant rejections in the AMT sample seems to be more than offset by the 90:1 time advantage. To adjust for this slightly higher rejection rate, syntacticians may want to consider adding 15% to the target sample size (e.g., 35 instead of 30). The statistical analyses presented in the following sections were performed on the remaining 173 participants in the laboratory sample and the remaining 151 participants in the AMT sample.
The primary concern of syntacticians is that the noise introduced by the uncontrolled environment of AMT might lead to lower statistical power than traditional laboratory-based experiments. To investigate this concern empirically, resampling simulations were run on each of the phenomena presented in the Case Studies for Analysis section above. These resampling simulations were designed to estimate the rate of statistical detectability for each phenomenon for every sample size between 5 and 173 for the laboratory sample, and between 5 and 151 for the AMT sample. In other words, these resampling simulations provide an answer to the questions: How likely am I to detect phenomenon X with a sample size of Y in the laboratory? And how likely am I to detect phenomenon X with a sample size of Y with AMT?
Choose one of the two samples (laboratory or AMT).
Choose a sample size (e.g., 5).
Randomly sample (with replacement) a number of participants equal to that size (e.g., 5) from the full data set.
Randomly choose one judgment for each condition from each of the participants in the sample.
Run a paired t test on the sample.
Repeat Steps 3–5 a total of 1,000 times.
Calculate the proportion of significant results (p < .05) out of those 1,000 samples; this is an estimate of the detection rate at that sample size.
Repeat Steps 2–7 for all of the other possible sample sizes (5–173 for the laboratory sample, 5–151 for the AMT sample).
Repeat Steps 2–8 for every possible number of judgments per participant per condition (in this case, 1–4).
Repeat Steps 2–9 for the other sample (laboratory or AMT).
Although there does appear to be a slight loss of statistical power in the AMT sample, this difference is relatively small by experimental standards: The AMT sample requires 3 or 4 more participants than the laboratory sample to reach 100% detectability. This suggests that any concern that syntacticians may have about AMT can be alleviated by increasing the sample size slightly. It should also be noted that both the laboratory sample and the AMT sample reached 100% detectability with fewer than 20 participants in the relatively underpowered one-judgment analysis. Given that the standard sample size in formal acceptability judgments is 25–30 and that it is standard to give each participant more than one judgment per condition, it seems unlikely that syntacticians would notice the slight power loss under normal experimental design conditions. In short, these results suggest that AMT is well suited to detect standard syntactic phenomena without any noticeable loss in statistical power.
For the center embedding and agreement attraction effects, the AMT sample once again appears to yield slightly lower detectability rates than the laboratory sample: The AMT sample requires 10 additional participants to reach detectability rates that are comparable to the laboratory sample. This does not appear to pose a significant problem for the use AMT, given the ease with which an additional 10 participants can be recruited. However, the comparative illusion detection rate in the AMT sample is potential cause for concern: The AMT sample appears to require 50 additional participants to reach detectability rates that are comparable to the laboratory sample. Given that two of the three extremely weak effects were detected within the AMT sample at rates comparable to the laboratory sample, it seems likely that the lower detection rate for comparative illusions may say more about comparative illusions than it does about the use of AMT. In fact, as we shall see in the next section, the distributions of the comparative illusion data suggest that fewer AMT participants were fooled by the illusion, which suggests that the lower detectability of the effect in the AMT sample may be indicative of more accurate judging by the AMT participants. Taken together with the fact that none of these effects are well suited to investigation using (nonspeeded) acceptability judgments in the first place, these results strongly suggest that syntacticians need not worry about the statistical power of AMT samples for true syntactic phenomena.
The shapes of the distributions
The distributions of the two samples are very similar for each of the conditions constituting the island effects: the peaks (modes) are approximately equal in location and frequency, and the overall shapes and widths of the distributions are approximately equal. It does appear that the rightward tail of the AMT distributions is slightly heavier than the rightward tail of the laboratory distributions, which may account for the marginal power difference between the two samples. But overall, the variation between the distributions appears to be well within the bounds of normal variation between samples.
The participant rejection rate is less than 15%, which is well within the normal bounds for behavioral experiments.
There is no evidence of a meaningful power loss for syntactic phenomena, and only a slight power loss for extremely weak (processing-based) effects.
There is no evidence of meaningful differences in the shapes or locations of the judgment distributions.
The online-only interface means that there is no way to ensure that the participants understand the task. This may contribute to the increased participant rejection rate over laboratory-based experiments.
There is similarly no way to debrief participants after the experiment to identify potential problems with the design, instructions, responses, and so forth. The only option is to include debriefing questions as part of the survey itself, which limits the ability to follow up based on the participant’s responses.
The increased participant rejection rate suggests a need for standard participant rejection criteria. Unfortunately, at present there are no standard participant rejection methods in the acceptability judgment literature.
The HTML foundation of AMT means that audio and visual stimuli may be used instead of text (as long as Web browsers support the multimedia file type). However, Amazon provides no mechanism for uploading multimedia files. Instead, researchers must store the multimedia files on their own Web server and link to the files in the HIT itself. An example template for audio files (an auditory acceptability judgment task) is included on the author’s Web site (see the Supplemental Materials section below).
The AMT system provides no mechanism for the collection of reaction times. The only time recorded by the AMT system is HIT completion time (the time from acceptance of the HIT to submission of the HIT), which can be used for participant rejection. If reaction times are crucial to the acceptability judgment experiment, one could use an independent experimental platform (such as WebExp) and use AMT to recruit participants and direct them to the independent experimental platform.
The AMT system does not include functions to aid in experimental design (as is common in dedicated experimental platforms). For example, AMT cannot automatically randomize the order of presentation in a survey. Instead, the experimenter must create randomized versions of the surveys by hand. If the experimenter does not create a novel randomization for each participant, then several participants will see the same randomization (as in this experiment). This adds some time to the construction phase of the experiment.
At present, the AMT worker pool is primarily composed of residents of the U.S. (46.8%) and residents of India (34%) (Ipeirotis, 2010). The composition of the worker pool is a direct reflection of Amazon’s payment system, which is currently configured to pay in U.S. dollars and Indian rupees only. The composition may change in the future as Amazon’s payment system expands; however, at present the lack of geographic diversity will likely affect the collection rates for languages other than English and Hindi, potentially limiting the benefits of AMT for cross-linguistic studies.
Any questions about native speaker ability should be informational only and, crucially, should not lead to nonpayment. This discourages misrepresentations, so that the answers can be used as participant rejection criteria during data analysis.
Researchers should run some sort of participant rejection or outlier removal process prior to analysis, since the AMT outlier rate is higher than the laboratory rate (14.2% vs. 1.7%).
Target sample sizes should be increased by 15% to accommodate the higher participant rejection rate.
If extremely weak effects are being investigated (i.e., effects that require sample sizes of 100 or more), 10 additional participants should be added to accommodate the slightly lower statistical power of the AMT sample.
HTML templates for five different acceptability judgment tasks (magnitude estimation, 7-point scale, yes–no, forced choice, and auditory) can be found on the author’s Web site (currently, www.ling.cogsci.uci.edu/~jsprouse/tools/amt/). This page also includes links to R scripts that may aid in the analysis of data collected using AMT and an online tutorial offered by Amazon about using the AMT Web site.
This research was supported in part by National Science Foundation Grant BCS-0843896. I thank Diogo Almeida for helpful comments, Jessamy Norton-Ford for assistance in the early stages of this project, and two anonymous reviewers for their thoughtful comments.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge: MIT Press.Google Scholar
- Chomsky, N. (1986). Barriers. Cambridge: MIT Press.Google Scholar
- Cowart, W. (1997). Experimental syntax: Applying objective methods to sentence judgments. Thousand Oaks: Sage.Google Scholar
- Frazier, L. (1985). Syntactic complexity. In D. Dowty, L. Karttunen, & A. Zwicky (Eds.), Natural language processing: Psychological, computational and theoretical perspectives (pp. 129–189). Cambridge: Cambridge University Press.Google Scholar
- Gibson, E., & Fedorenko, E. (in press). The need for quantitative methods in syntax. Language and Cognitive Processes. Google Scholar
- Grimshaw, J. (1986). Subjacency and the S/S' parameter. Linguistic Inquiry, 17, 364–369.Google Scholar
- Ipeirotis, P. G. (2010). Demographics of Mechanical Turk. Center for Digital Economy Research Working Papers, 10. Available at http://hdl.handle.net/2451/29585
- Keller, F. (2000). Gradience in grammar: Experimental and computational aspects of degrees of grammaticality. University of Edinburgh: Unpublished doctoral dissertation.Google Scholar
- Kendall, M. (1938). A new measure of rank correlation. Biometrika, 30, 81–89.Google Scholar
- Kuno, S. (1973). Constraints on internal clauses and sentential subjects. Linguistic Inquiry, 4, 363–385.Google Scholar
- Phillips, C., Wagers, M., & Lau, E. (in press). Grammatical illusions and selective fallibility in real-time language comprehension. Language and Linguistics Compass.Google Scholar
- R Development Core Team. (2009). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Available at www.R-project.org
- Ross, J. R. (1967). Constraints on variables in syntax. Unpublished doctoral dissertation, MIT, Cambridge, MA.Google Scholar
- Schütze, C. (1996). The empirical base of linguistics: Grammaticality judgments and linguistic methodology. Chicago: University of Chicago Press.Google Scholar
- Sprouse, J., & Almeida, D. (2010). A quantitative defense of linguistic methodology. Manuscript submitted for publication.Google Scholar
- Sprouse, J., Wagers, M., & Phillips, C. (2010). A test of the relation between working memory capacity and island effects. Manuscript submitted for publication.Google Scholar
Open AccessThis is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.