Method
Materials
Three types of letter endings were used in the main experiment: noun suffixes, adjective suffixes, and nonmorphological noun endings. We identified only four nonmorphological adjective endings (–IKE, –LETE, –UL, –UNG), and therefore were unable to take advantage of this manipulation. Two comparisons were planned: (1) that between noun and adjective suffixes, and (2) that between noun suffixes and noun endings. Table 1 lists all endings; Table 2 lists descriptive statistics for psycholinguistic variables. Suffixes and nonmorphological endings, i.e., endings of non-suffixed words, were extracted from CELEX (Baayen et al. 1993). Twenty-five noun suffixes were matched to 25 noun endings on type frequency, diagnosticity, and length in letters (see Table 2). Note that token frequency was not controlled in this experiment, and noun endings were lower in token frequency than noun suffixes (t = 4.64, p < 0.0001). The type diagnosticity measure captured the amount of meaningful information in a given spelling. For a particular spelling, diagnosticity is calculated by dividing the number of words ending in this spelling and falling into this category by the total number of words that contain the spelling (see Ulicheva et al. 2020, for details). Type frequency is the number of words in CELEX that ended in given letter patterns. For instance, the frequency value for –ER included pseudoaffixed words such as CORNER, as well as morphologically simple words such as ORDER.
Table 1 Suffixes and endings for the experiments. Noun suffixes printed in italics were removed from the noun suffix/noun ending comparison, because matching nonmorphological endings were not available Table 2 Descriptive statistics for matching variables (no suffix exclusions) Only 21 nonmorphological endings that could be matched to noun suffixes on frequency and diagnosticity were identified, because nonmorphological endings are typically characterised by substantially lower values on both metrics (see Fig. 1). Thus, four noun suffixes, i.e. –ER, –MENT, –NESS, –ISM, for which nonmorphological counterparts were not available, were removed from the relevant analyses (see Table 1). Every participant saw each ending four times (except for the 21 lower-frequency adjective suffixes that appeared eight times). Note that some of our nonmorphological originated from Classical languages where those functioned as productive morphemes (e.g., –ME as in “morpheme”, “phoneme”, “rhizome”; –M as in “rheum”; –LOGUE as in “analogue”, “catalogue”; Dee 1984). The design yielded 368 items in total. All stimuli, experimental lists used for presentation, as well as further details on matching across conditions are available on the OSF storage of the project and can be viewed online (https://osf.io/rbxpn/).
Monosyllabic 3-4 letter nonword stems that ended in a consonant were taken from the ARC nonword database (5942 stems; Rastle et al. 2002). These stems were joined with endings. Real words (e.g. lin–EN) as well as homophones (e.g. /dju–tI/) were filtered out. Further, we removed the following: nonwords containing infrequent bigrams (<6 instances per million) and trigrams (<3 instances per million), nonwords that had at least one orthographic neighbour (Coltheart et al. 1977), nonwords with ambiguous endings (e.g. “cli–sy”/”clis–y”), word-like nonwords (e.g. “briber”, “bonglike”, “lawlist”, “thegent”). A manual pronounceability check was not feasible due to a large number of nonwords that were used in this experiment (40112). We minimised the possibility that the presence of “odd” nonwords could influence the results by presenting each participant with a unique combination of stems and endings. Each participant saw a unique experimental list where stems were never repeated.
Procedure
The experiment was implemented online using the Gorilla Experiment Builder (www.gorilla.sc; Anwyl-Irvine et al. 2019). The task was to decide if “a letter string looks like a noun or an adjective” by clicking one of the two labelled buttons on the computer screen. It was explained that a noun is the name of something such as a person, place, thing, quality, or idea, and an adjective is a describing word. Real-word examples were given (“time”, “people”, “way”, “year”; “red”, “simple”, “clever”), and the experiment began with two practice trials that involved real words (“lamp”, “colourful”) to ensure that participants understood the task. The experiment did not start until the responses on all practice trials were correct. On each experimental trial, participants had eight seconds to respond, otherwise no response was recorded, and the software advanced on to the next trial automatically. For the final seconds of each trial, a countdown clock was displayed in the upper-right corner of the screen. The whole task took, on average, 20 minutes. A progress bar was displayed in the upper left corner of the screen. Trial order was random for each participant. Participants were offered to take three breaks throughout the experiment.
Participants
In order to take part in the study, participants had to be right-handed, British citizens, with no previous history of dyslexia, dyspraxia, ADHD, or any related literacy or language difficulties, raised in a monolingual environment and speaking English as the first language. 109 participants completed the study via Prolific Academic. They were, on average, 24 years old (from 19 to 27 years old); 68 of them were females. One participant indicated that they could also speak French. In terms of education, one participant did not finish high school, 16 finished high school, and 40 finished university. Three participants received professional training, and the rest had a graduate degree.
Average reward per hour was £10.35. Participant read an informed consent form and confirmed that they were willing to take part in the experiment. Since the task was performed online, an extra check was necessary to filter out participants that were not paying attention and/or making little effort to perform well. The main categorisation task did not permit making such judgement, because any response (noun/adjective) was acceptable for any nonword. We opted to use participants’ performance on the spelling task as a criterion to filter out poorly performing participants, because in this task, the correct response on each trial was known a priori. Altogether, we excluded three participants whose spellings were further away from the correct spellings (i.e., more than 3 SD away).Footnote 1 The distance from the correct spelling was estimated using the Levenshtein distance measure (vwr package in R; Keuleers 2013). For instance, one of the excluded participants produced responses like “youfemism”, “apololypse”, “bueocrat” for “euphemism”, “apocalypse”, and “bureaucrat”, respectively. Data from 105 participants were retained for analyses.
Tasks measuring individual differences
Vocabulary. Participants completed the Vocabulary sub-scale of the Shipley Institute of Living Scale (Shipley 1940). The vocabulary test consisted of 40 items and required participants to select one word out of four which was most similar to a prompt word in meaning. Response time was unlimited. Vocabulary scores ranged from 14 to 39.
Author recognition. In this test, participants are presented with author names and foils, and are asked to indicate which authors they recognise as real. This test is a reliable predictor of reading skill because author knowledge is thought to be acquired through print exposure (Moore and Gordon 2015; Stanovich and West 1989). The list of 65 existing authors was taken from Acheson et al. (2008). According to an analysis done by Moore and Gordon (2015), the variation in responses that their participants gave to 15 names from this list was minimal and did not have discriminatory power. Therefore, we replaced these 15 names with the names of our choice. These new names were taken from the lists of Pulitzer, Booker, and PEN prizes between 2001 and 2012. We used 65 foil names that were used by Martin-Chang and Gould (2008). Our participants were instructed to avoid guessing as they would be penalised for incorrect responses. The total score was the numerical difference between the number of authors that were identified correctly and the number of authors guessed incorrectly by a participant. This total score ranged from 2 to 49 (out of 65), the mean was 15.
Spelling. Forty words eight letters in length, taken from Burt and Tate (2002), were presented for spelling production. Each word’s recording was presented first in isolation, and then a second time in a sentence. The recordings could be replayed for up to 10 times. Participants could type in their spellings after both recordings stopped playing, and they had 15 seconds to do so. A countdown clock was displayed for the last five seconds of each trial. Spelling scores ranged from 0 to 39 (mean was 15).
Analyses
The analyses were performed using generalized linear mixed-effects models (Baayen et al. 2008) as implemented in the lme4 package (Version 1.1-14, Bates et al. 2015) in the statistical software R (Version 3.6.1, R Development Core Team 2018). First, we will present the results of two planned comparisons: (1) adjective suffixes vs noun suffixes; (2) noun suffixes vs noun endings. Two separate linear-mixed models were run to analyse each of these contrasts. Our statistical models included Response as a dependent variable (a binary categorical variable, Adjective coded as 1, or Noun coded as 0), Condition, i.e., ending type (adjective suffix or noun suffix or noun ending, depending on the comparison), as a fixed factor, and random intercepts for subjects and suffixes. Second, we will report the effects of diagnosticity on participants’ behaviour. Finally, we will investigate the sources of individual variation in people’s sensitivity to these cues.
Item-based variability
Planned comparisons across ending types.Footnote 2 The first planned comparison was the contrast between adjective and noun suffixes. As expected, we observed a significant main effect of condition (z = 5.694, p < 0.001; see Fig. 2) so that more adjective responses were given to nonwords that ended in adjective suffixes compared to noun suffixes. The second planned comparison was the noun suffix versus nonmorphological noun ending contrast, and here as well, we observed a significant difference between the conditions: suffixed nonwords elicited fewer noun responses than nonmorphological nonwords (z = −3.132, p < 0.01).
In order to understand potential sources of item-based variability, we studied the relationship between ending diagnosticity and participants’ responses. Three additional statistical models were implemented separately for each ending type (adjective suffix, noun suffix, noun ending).Footnote 3 The models used the continuous measure of diagnosticity as the only fixed predictor (dependent variable as well as random effects were identical to the models described above). The results were as follows. Firstly, among adjective suffixes, those with a higher diagnosticity value appeared more adjective-like to our participants (z = 3.642, p < 0.001), see Fig. 3. The diagnosticity of noun suffixes did not significantly influence responses to nonwords that contained these suffixes, see Fig. 4 (z = 0.831, p = 0.406). Similarly, we did not observe any impact of diagnosticity of nonmorphological endings on responses to nonwords, see Fig. 5 (z = −0.983, p = 0.326).
Subject-based variability
In order to address the question of individual differences, we investigated the relationship between participants’ performance on background language and literacy measures and their nonword classification performance. Linear mixed models included an interaction between ending type condition and participants’ scores on language tasks (three separate models were implemented due to a high correlation between individual characteristics, see Table 3), and potential influences on interpretability of individual effects that this may have (Belsley et al. 2005). The dependent variable as well as the structure of random effects were identical to those in the other analyses reported above. Participants’ responses aligned with the predicted lexical category more when participants were better spellers (adjective vs noun suffixes: z = 19.814, p < 0.001; noun suffixes vs noun endings: z = −5.189, p < 0.001), had better vocabulary (adjective vs noun suffixes: z = 16.619, p < 0.001; noun suffixes vs noun endings: z = −7.630, p < 0.001), or had better author recognition scores (adjective vs noun suffixes: z = 11.787, p < 0.001; noun suffixes vs noun endings: z = −3.306, p < 0.001).
Table 3 Correlation matrix that reflects the relationships between participants’ performance on language tasks (ART, vocabulary, and spelling). Spelling scores were sign-transformed for interpretability, so that higher values on all variables reflect better performance Discussion
Experiment 1 replicated earlier findings (Ulicheva et al. 2020). Nonwords with adjective-biasing endings were categorised as adjectives more frequently than nouns. The effects of suffix diagnosticity were graded such that the number of adjective responses co-varied with increasing diagnosticity. We interpret these graded effects as evidence for a statistical learning mechanism that is involved in assimilating these spelling-to-meaning regularities. This conclusion is strengthened when we consider the relationship between participants’ performance on language and literacy tests and their sensitivity to category information.
In this experiment, noun endings functioned as stronger cues to category compared to noun suffixes. While these conditions differed in token frequency, as we discovered post-hoc (see Materials), we think that it is unlikely that token frequency is responsible for these observed differences. The reason for this view is that noun endings were lower in token frequency than noun suffixes, and as such, they should be weaker cues to category. We discuss this finding further in the General Discussion.
One potential flaw of the present experiment is that it involved metalinguistic judgements about lexical category. This is problematic for at least two reasons. Firstly, participants may not have received adequate training as to fully grasp the nuanced distinction between adjectives and nouns. Secondly, the requirement to make metalinguistic judgements could have biased participants to pay attention to lexical category cues. Therefore, in Experiment 2 we opted for an implicit task less prone to these metacognitive influences.