We used Python as the primary scripting language to generate Polish pseudowords. The process of pseudoword generation is presented in Fig. 1 and outlined in more detail in subsections “Stimuli Generation and Selection” for Studies 1 and 2. We conducted two studies that allowed us to evaluate the stimuli using ratings from online surveys for crosscheck of machine and human stimuli evaluation. This resulted in two sets of pseudowords containing information on their grammatical properties (e.g., grammatical class, grammatical gender of pseudonouns) and psycholinguistically relevant properties (e.g., number of letters, number of syllables, and OLD20). To resolve the problem of the probability of letter-cluster occurrences in the beginning of the word (Hawelka et al., 2013), we also provided the first-syllable frequency for each pseudoword that we extracted from the Polish corpus (Lexical Computing CZ s.r.o., 2015; Jakubíček et al., 2013; Suchomel & Pomikálek 2012). The supplementary materials also contain results of a rating study (e.g., the percent of correct identifications of grammatical class, mean and standard deviation of ratings of similarity to a real word).
To select the most characteristic grammatical endings for each of the chosen grammatical classes, we used data from the most common Polish language dictionaries, which contain summaries of morphological features embedded in nouns and verbs (Drabik et al., 2018; Dunaj, 2012; Sobol 2002). We extracted specific exponents for feminine, masculine, and neuter nouns in their nominative form and verbs in the infinitive form for Stimuli Set 1 and kept the last syllable of the real word in Stimuli Set 2.
Importantly, the proposed method of pseudoword stimuli generation can be extended to any other grammatical category and, thus, offers a broader framework on pseudoword construction for all languages that rely on words’ morphological features to signify their grammatical properties. This capability also easily extends beyond grammatical categories to all grammatical, syntactic, or other features that go beyond the semantics and are indicated by word’s morphology in a given language (e.g., gender in nouns and tense in verbs). Researchers should verify that such an indication is unambiguous for the language in which they plan to conduct research. For example, in Polish, inflection of a verb can correspond to the gender of the agent in the sentence, and feminine nouns and several masculine nouns use the same suffixes - ‘Koleżanka(f.)/Kolega(m.) otwieradrzwi.’ (‘A colleague opens the door’). Thus, pseudowords pertaining to any of these categories would not be unambiguously distinguishable. For that reason, we abstained from the generation of this type of stimuli in our example pseudoword sets, and we used only infinitive verbs and excluded masculine nouns that have suffixes identical to feminine nouns. The scripts used to generate the two initial sets of pseudowords for this paper are available as supplemental online material (SOM) at https://gin.g-node.org/SocGramLab/pseudo-word-stimuli-with-arbitrary-constrains.git.
Study 1: Pseudowords with Different stems
Stimuli Generation and Selection
We extracted the most frequently used Polish words using the Sketch Engine (https://app.sketchengine.eu) framework (Kilgarriff et al., 2004). Specifically, we used the Wordlist tool together with the Polish web corpus (Lexical Computing CZ s.r.o., 2015; Jakubíček et al., 2013; Suchomel & Pomikálek 2012). This corpus comprises more than 7 billion words in 22 million online documents that were crawled in June 2012 to obtain over 36 million unique word forms. As this was the primary data input for Study 1, we selected 1,000 of the most frequent verbs and 1,000 of the most frequent nouns.
The initial word lists were filtered using the following procedure. First, we removed duplicate entries and word forms that contained any characters not in the standard Polish alphabet (i.e., word forms that were included in the corpus due to errors and defective procedures, such as optical character recognition). Next, to obtain additional characteristics of each word, we used the Grammatical Dictionary of Polish (Saloni et al., 2015), which includes a Morfeusz 2 inflectional analyzer and generator (Kieraś & Woliński, 2017). For verbs, we selected only infinitives. For nouns, only the nominative forms of singular masculine, feminine, and neuter forms were preserved (further classification details, including lists of suffixes used to assign these categories, are available in the SOM). Furthermore, we kept only nouns classified as common nouns and disregarded any words that were classified as belonging to a specialized domain, such as music (Latin-based, e.g., allegro, or “fast music tempo”), linguistics (relating to metalinguistics, e.g., przypadek, or “case”), or archaic (rarely used in modern Polish, e.g., dziewierz, or “brother-in-law”).
Next, words were hyphenated according to Polish syllabification rules using Pyphen (Kozea Community, 2018). We set the language dictionary to “pl_PL” and the minimum number of characters of the first and last syllable to 1. We eliminated words that had less than two syllables to ensure that words clearly indicated the grammatical class with a suffix in the last syllable. We also eliminated words that had more than four syllables to avoid obtaining pseudowords with extremely different lengths.
To ensure that the input nouns and verbs were matched with respect to their usage frequency in real language, we applied the Hungarian algorithm to create pairs of nouns and verbs with the closest frequency in the corpus (Kuhn, 1955). In short, for a given 2D array containing some weights (interpreted as the penalties or costs of assigning items that typically belong to two distinct categories to pairs), this algorithm solves the assignment problem in which the sum of weights (i.e., total penalty/cost) is minimized. In our case, the array represented the difference between the frequencies of verbs and nouns in the corpus. This resulted in a collection of source nouns and verbs with the most closely matched frequencies. The average absolute frequency difference for all pairs, normalized by the average frequency of the items in the pair, was 0.08 with a standard deviation of 0.27 (p.d.u. based on the difference between words with respect to their absolute frequency, which is a Sketch Engine metric defined as the direct count of how many times a given item was found in the corpus).
Next, the word pairs were inputted into the Wuggy program (Keuleers & Brysbaert, 2010) with the Polish module, which generated pseudowords based on the provided real words without altering the suffix that denotes grammatical class. In order to preserve the suffix of the original source word we used Wuggy’s Matching Expression feature that provides a way to require pseudowords to match provided regular expression (Keuleers & Brysbaert, 2010; further information about regular expressions is widely available online, for a technical description refer to IEEE et al., 2018). For example, using the source word “verbifying” and the regular expression “.+ing$” would result in Wuggy generating only pseudowords based on the source word and ending in –ing. In our pipeline, for each real word used as a source of pseudowords we used a regular expression that required generated pseudowords to retain the ultimate syllable from the source word. For the obtained pseudonouns and pseudoverbs, we calculated the frequencies of their initial syllables (all but the last one), and we verified that these two groups of stimuli were similar in terms of a syllable frequency distribution (Kolmogorov-Smirnov test was insignificant, suggesting that the two independent samples are drawn from the same distribution). We repeated this test for the first syllables only, and this test was also insignificant, suggesting that our pseudonouns and pseudoverbs did not differ with respect to the frequency of their first syllables. We obtained a total of 640 stimuli (320 pseudonouns, e.g., syjcol, setylda, and 320 pseudoverbs e.g., chlocić, osordać), each of which contains six to eight letters. For these pseudowords we computed OLD20 values with reference to the Grammatical Dictionary of Polish (Saloni et al., 2015).
Stimuli Evaluation Procedure
The goal of this procedure was to test the pseudowords with multiple stems in terms of association of OLD20 values, based on people’s pseudowords perception, and extraction of grammatical class clearly indicative of a pseudoword being a verb or a noun. Participants were invited to take part in an online study and were randomly assigned to one of 16 lists that presented 40 pseudowords in a fixed random order. Before the stimuli evaluation task, participants were asked to provide basic demographic information, such as age, gender, education, foreign language knowledge, and diagnosed language processing disorders, such as dyslexia or other language impairments. We also asked whether Polish was their native language.
For each pseudoword, we asked two questions. The first concerned the similarity of the presented pseudoword to any real word in Polish (e.g., “To what extent does the word poceda resemble a word that exists in Polish?”), with answers ranging from 0 (not at all) to 4 (very much). The second question concerned the grammatical class of the presented pseudoword (e.g., “Which grammatical class might the word poceda be?”). Participants had to select one of the presented options: “verb,” “noun,” “adjective,” “other,” or “I don’t know.”
A total of 328 people participated in this study (247 women, 76 men, 4 people who refused to indicate their gender, and 1 person who declared their gender to be “other”; Mage = 31.06 years, SDage = 8.58 years). Twenty-four people were excluded from the data analysis due to their declaration of diagnosed language processing disorders, and one person was excluded because their declared native language was not Polish. We based the final evaluation of pseudowords in this study on ratings from 303 people.
Resulting Word List for Study 1
In order to choose stimuli that were most clearly perceived as verbs or nouns, we applied the following criteria. First, the accuracy of grammatical class identification had to be above 80% to ensure that words were commonly recognized as either verbs or nouns. Moreover, we only chose pseudowords that were rated significantly less than 3 on the scale measuring similarity to Polish words to ensure that the pseudowords were not too similar to any existing word. The final dataset of pseudowords contained 26 feminine, 79 masculine, and 12 neuter nouns as well as 41 verbs in the infinitive form. A summary of the basic properties of the words is presented in Table 1, and the dataset containing all stimuli and those selected for our list is available in the SOM.
The results of a Spearman correlation indicated a nonsignificant relationship between OLD20 values and human average ratings of pseudoword similarity to real words for pseudonouns (r(115) = 0.12, p = .21) and a nonsignificant relationship between OLD20 values and human ratings for pseudoverbs (r(39) = − 0.23, p = .14). These findings indicate that the OLD20 measure of similarity to existing words and human subjects’ judgment of pseudowords’ similarity to real words are independent. Furthermore, even though the two correlations were not significant, the correlation of pseudonouns and OLD20 values was positive and the correlation of pseudoverbs and OLD20 values was negative. Therefore, we applied using the Fisher’s Z-Transformation to compare whether there was a significant difference between the two correlation coefficients. Importantly, the comparison indicated that the two correlation coefficients were similar z = -1.89, p = .06. Therefore, we can conclude that the pseudonouns and pseudoverbs are similarly unrelated to OLD20 values.
Study 2: Pseudowords with Shared Stem
Stimuli Generation and Selection
We extracted the most frequently used Polish words using the Sketch Engine (https://app.sketchengine.eu) framework (Kilgarriff et al., 2004). Specifically, similarly to Study 1, we used the Wordlist tool together with the Polish web corpus (Lexical Computing CZ s.r.o., 2015; Jakubíček et al., 2013; Suchomel & Pomikálek 2012), however, for Study 2 we selected 20,000 of the most frequent words as the primary data input (initial corpus).
Our aim in this study was to obtain pairs that consist of pseudoverbs and pseudonouns that share a common stem but differ with respect to their suffixes (which enable distinction between grammatical classes). To achieve this goal we developed a procedure for plausible last-syllables substitution that resulted in the generation of pseudoword pairs with a common stem. The initial steps of the procedure employed in this study were analogous to the steps used in the first study. However, here, we did not use the Hungarian algorithm to select verb–noun pairs of closest corpora frequency but instead utilized (as an input to the Wuggy program) all nouns and verbs that were identified in the initial corpus and meet same inclusion criteria as in Study 1. Similarly to Study 1, we only kept words containing standard Polish alphabet characters—verbs in the infinitive form and common nouns in singular nominative masculine, feminine, and neuter form. Next, based on the pseudowords generated by Wuggy, for each penultimate syllable, we constructed a list of all ultimate syllables that could follow it. To that end we constructed a list of penultimate syllables that were present in pseudowords generated by Wuggy. For each penultimate syllable we produced a list of all plausible ultimate syllables that can follow it according to the Wuggy output. This resulted in an exhaustive list of all orthographically and phonologically plausible ultimate syllable substitutions conditioned on the penultimate syllable present in the Wuggy output. All ultimate syllables of the obtained pseudowords were possible inflectional noun/verb endings because, as in Study 1, the Wuggy was configured to retain the last syllable of the pseudowords generated. Each ultimate syllable clearly indicated the grammatical class of pseudoword that it terminated. Based on these lists, more pseudowords were generated by means of last-syllable substitution (all possible substitutions were used conditioned only on the penultimate syllable). This approach minimized the probability of an accidental introduction of impossible or very rare suffixes to the generated pseudowords. Since we aimed for the production of pseudoverb–pseudonoun pairs, for any further processing, only pseudowords with stems that had been assigned with (jointly) at least one noun-indicating and at least one verb-indicating suffix were considered (we removed pseudowords with stem that was present in only one grammatical category). To further ensure orthographic and phonological plausibility of the pseudowords, we removed words that contained bigrams that are extremely rare or impossible in Polish, for example, “ff” or “yx” (for the full list, please refer to the SOM). Using the Grammatical Dictionary of Polish (Saloni et al., 2015), we also removed any real words that could have resulted from the syllable substitution procedure. Furthermore, we only kept words containing 6–8 characters. For the obtained pseudowords the OLD20 metric was computed with reference to the Grammatical Dictionary of Polish (Saloni et al., 2015), and to ensure that the resulting pseudowords were orthographically and phonologically plausible but not too similar to real words, we removed any pseudowords that had OLD20 below 2.0 or above 3.5. Additionally, because we intended to produce pairs of stimuli that contained items of comparable properties, after the above filtering based on OLD20, we only kept the pseudonoun and pseudoverb sets that shared a common stem. In other words, a pseudonoun was removed from the set if no pseudoverb shared a stem with it, and the other way around. For the same reason we only kept pseudonouns and pseudoverbs that were able to form common stem-based pairs for which the difference in OLD20 measure computed with reference to the Grammatical Dictionary of Polish (Saloni et al., 2015) between potential pair elements was less than 0.5 and contained the same number of characters within potential pair items. Furthermore, based on the noun suffixes, we assigned grammatical genders to the obtained pseudonouns and only kept words for which their suffixes allowed for unambiguous grammatical gender classification. At this stage, we obtained 385 pseudonouns (121 masculine, 203 feminine, and 61 neuter). Next, we randomly selected 50 nouns for each grammatical gender category. The obtained 150 pseudonouns were used to select 265 pseudoverbs with which they shared a common stem. Altogether, these constituted a set of 415 pseudowords (e.g., grocunek, grocukać, rocezja, rocerać) that were evaluated by human subjects. Some pseudoverbs shared stems with more than one pseudonoun (e.g., mocejać, mocetła, mocezja), and some pseudonouns shared stems with more than one pseudoverb (e.g., rorekia, rorerać, roregać).
Stimuli Evaluation Procedure
The goal of this procedure, analogous to that in Study 1, was to test the stimuli in terms of association of OLD20 values with people’s pseudoword perceptions and extraction of grammatical class clearly indicative of a pseudoword being a pseudoverb or pseudonoun. In this study, however, we used stimuli that shared a stem to more strongly emphasize grammatical class differences in the last syllable. Participants were invited to take part in an online study and were randomly assigned to one of 12 lists that presented 35 pseudowords in a fixed random order. As in Study 1, before the stimuli evaluation task, participants were asked to provide basic demographic information, including age, gender, education, foreign language knowledge, and diagnosed language processing disorders, such as dyslexia or other language impairments. We also asked whether Polish was their native language. Three questions were asked for each pseudoword. The first concerned the possibility that a word could be a real Polish word (i.e., whether a pseudoword was constructed in line with Polish lexical rules; “Estimate probability in which poceda can be a Polish word”). The second question asked about the similarity of the presented pseudoword to any real Polish word (e.g., “To what extent does the word poceda resemble a word that exists in Polish?”). For both of these questions, answers ranged from 0 (not at all) to 4 (very much). The third question concerned the grammatical class of the presented pseudoword (e.g., “Which grammatical class might the word poceda be?”). Participants had to select one of the presented options: “verb,” “noun,” “adjective,” “other,” or “I don’t know.” Finally, we asked how carefully the participants filled out the questionnaire, with answers ranging from 0 (careless) to 3 (very carefully).
A total of 312 people participated in this study (228 women, 80 men, 2 people who refused to indicate their gender, and 2 people who declared their gender to be “other”; Mage = 31.5 years, SDage = 9.51 years). We excluded 22 people from the data analysis due to their declaration of diagnosed language processing disorders and four people based on their self-evaluated low attention (0, or careless). We thus based our evaluation of the pseudowords with different stems on the ratings of 286 people.
Resulting Word List for Study 2
In order to select stimuli, we applied the following criteria. First, as in Study 1, the accuracy of grammatical class identification had to be above 80% to ensure that words were commonly recognized as either pseudoverbs or pseudonouns. Moreover, we only chose pseudowords that were considered possible in Polish (ratings significantly higher than 3), as this indicated phonological acceptance. Furthermore, we only chose words that were rated as significantly less than 3 on the scale measuring similarity to Polish words, as this ensured that the pseudowords were not too similar to any existing word. In addition, the OLD20 measure (M = 2.81 for pseudonouns and M = 2.84 for pseudoverbs) was used to select pseudowords that were not too similar to or too distant from the real words on which they were based and to construct pseudoword noun–verb pairs with similar OLD20 values. Next, using the Hungarian algorithm, we paired pseudowords to establish noun–verb pairs with a shared stem and the most similar properties. This resulted in a final dataset of pseudowords containing 121 pseudonoun and pseudoverb pairs constructed from 93 pseudoverbs and 47 pseudonouns. Pairs were not exclusive with respect to stem (one pseudoverb could share a stem with more than one pseudonoun and one pseudonoun could share stem with more than one pseudoverb). A summary of the basic properties of the pseudowords set is presented in Table 2, and the dataset, including all stimuli and those selected for our list, is available in the SOM.
The Spearman correlation between the similarity of pseudowords to real Polish words and the OLD20 measure was significant for pseudoverbs (r(91) = − 0.31, p < .05) and nonsignificant for pseudonouns (r(45) = − 0.23, p = .13). In the case of pseduoverbs, we obtained a significant result of a negative low correlation, which means the higher resemblance to a real Polish word, the lower the OLD20 value. It has to be noted, however, that the sample size for pseudoverbs was higher than in case of pseudonouns, likely driving the significance of the correlation coefficient. When we compared the two correlation coefficients using a comparison of correlation from independent samples using Fisher’s Z-Transformation method, the result indicated a nonsignificant difference: z = − 0.47, p = .64. This suggests that pseudonouns and pseudoverbs are similarly unrelated to OLD20 (as in Study 1) and that the significance of the correlation between pseudoverbs and OLD20 is likely an artifact.
The Spearman correlation between OLD20 values and the possibility that the presented pseudoword is a Polish word were calculated separately for the pseudonoun and pseudoverb in each pair. The results indicated a nonsignificant relationship between OLD20 values and the possibility that the presented pseudoword could be a Polish word for both pseudoverbs (r(91) = − 0.10, p = .36) and for pseudonouns (r(45) = − 0.11, p = .46).
Additionally, the z-score of comparison between the correlation coefficients for pseudoverbs and pseudonouns in relation to OLD20 indicated that the two correlation coefficients were similar: z = 0.05, p = .96.