Method
Participants
Participants taking part in the surveys were all Polish native speakers, and were recruited from online social media, research mailing lists, and language forums. Participants spent less than 15 min to complete each survey. Importantly, raters who failed to complete the entire survey were removed from the analyses. Cloze probability tests were completed by 140 participants (Mage = 22.83, SD = 2.52; 123 females), meaningfulness ratings—by 132 participants (Mage = 21.8, SD = 2.4; 115 females), familiarity ratings—by 101 participants (Mage = 22.33, SD = 2.37; 80 females), and metaphoricity ratings—by 102 participants (Mage = 23.32; SD = 2.4, 87 females).
Materials and Design
Materials used in the ratings included 120 novel nominal metaphors (e.g., Blizny to pamiętnik; Eng: Scars are a diary), 120 novel similes (e.g., Blizny są jak pamiętnik; Eng: Scars are like a diary), 120 literal (e.g., Ten egzemplarz to pamiętnik; Eng: This copy is a diary), and 120 anomalous sentences (e.g., Ten młot jest jak pamiętnik; Eng: This hammer is like a diary). Each set shared the same sentence-final word, which was always a concrete noun. Novel nominal metaphors and novel similes shared the same target and source domain, and they differed only in their syntactic structure (i.e., A to B; Eng: A is B vs. A jest jak B; Eng: A is like B). Additionally, the critical words were controlled for in terms of their frequency per million (M = 4.68, SD = .45, range 2.47–4.7), number of syllables (M = 2.34, SD = .48, range 2–3), and number of letters (M = 6.57, SD = 1.45, range 4–11). Frequency values were calculated using the SUBTLEX-PL corpus (Mandera et al. 2014). The mean sentence lengths of the stimuli ranged from 3 to 5; novel nominal metaphors: M = 3.28, SD = .50, novel similes: M = 4.28, SD = .48, literal sentences: M = 4.00, SD = .18, and anomalous sentences: M = 3.71, SD = .57. The list of Polish stimuli is provided in Appendix 1.
Procedure
All of the normative studies were conducted using an online survey-development cloud-based software that enabled designing web-based surveys and collecting survey responses. For the normative tests, the materials were divided into four (meaningfulness ratings) or three (cloze probability tests, familiarity, and metaphoricity ratings) blocks so as to avoid the repetition of the critical word within one block. While meaningfulness ratings included all types of utterances (i.e., novel nominal metaphors, novel similes, literal, and anomalous sentences), normative tests on cloze probability, familiarity, and metaphoricity involved meaningful stimuli only (i.e., novel nominal metaphors, novel similes, and literal utterances).
Participants first gave their consent, after which they were provided with general instructions about the task. In all cases, the instructions were presented together with several examples and explanations. Each participant completed only one block, and rated the presented 120 sentences (in meaningfulness ratings) or 90 sentences (in cloze probability tests, familiarity, and metaphoricity ratings). All of the rating scales included 7-point Likert-type scales, i.e. meaningfulness ratings: 1—totally meaningless, 7—totally meaningful; familiarity ratings: 1—very rarely, 7—very frequently; metaphoricity ratings: 1—very literal, 7—very metaphorical). In cloze probability tests, raters were provided with the beginning of a sentence, and were asked to write a critical word (a noun) which first came to their mind, so that the whole sentence would be meaningful and syntactically correct. The order of stimuli presentation within each survey was randomized and counterbalanced across participants, with an equal number of stimuli per each condition presented to all participants.
For the normative studies with rating scales on stimuli meaningfulness, familiarity, and metaphoricity, analyses of variance (ANOVAs) were conducted, whose results are reported below. Significance values for pairwise comparisons were corrected for multiple comparisons using the Bonferroni correction. If Mauchly’s tests indicated that the assumption of sphericity was violated, the Greenhouse–Geisser correction was applied. In such cases, the original degrees of freedom are reported with the corrected p value.
Results
To determine the reliability of the norming tests, intraclass correlation coefficients were calculated for all dimensions requiring a subjective rating. All measures indicated a high consistency across raters (Table 1).
Table 1 Study 1: Interclass correlation coefficients for the rating tasks Cloze probability tests
Cloze probability tests were carried out with a view to ensuring that all of the critical (sentence-final) words were not expected due to the preceding context. Table 2 summarizes the results obtained from the cloze probability tests (reported as the percentage of participants who completed the presented sentence with a critical word) together with familiarity ratings and the correlation between the two variables.
Table 2 Study 1: Cloze probability and familiarity results, along with the correlation between the two variables, for novel nominal metaphors, novel similes, and literal sentences Meaningfulness ratings
To evaluate the meaningfulness of the sentences, raters assessed them on a scale from 1 (totally meaningless) to 7 (totally meaningful). The analysis showed a main effect of utterance type, F(3, 384) = 906.25, p < .001, ε = .774, ηp2 = .876. Pairwise comparisons further revealed that literal sentences (M = 5.66, SE = .07) were rated as more meaningful than novel similes (M = 4.42, SE = .08), p < .001, novel similes were rated as more meaningful than novel nominal metaphors (M = 3.87, SE = .08), p < .001, and novel nominal metaphors were assessed as more meaningful compared to anomalous utterances (M = 1.80, SE = .06), p < .001.
Familiarity ratings
In order to examine the familiarity of the stimuli, raters decided how often they encountered the presented novel nominal metaphors, novel similes, and literal sentences on a scale from 1 (very rarely) to 7 (very frequently). The obtained results revealed a main effect of sentence type, F(2, 196) = 45.94, p < .001, ε = .562, ηp2 = .319. Pairwise comparisons confirmed that novel nominal metaphors (M = 1.80, SE = .09) were less familiar than both novel similes (M = 1.88, SE = .09), p = .014, and literal sentences (M = 2.51, SE = .13), p < .001. Furthermore, novel similes were less familiar than literal utterances, p < .001.
Metaphoricity ratings
In order to assess the metaphoricity of the stimuli, raters decided how metaphorical or literal novel nominal metaphors, novel similes, and literal sentences were on a scale from 1 (very literal) to 7 (very metaphorical). The analysis showed a main effect of sentence type, F(2, 198) = 902.18, p < .001, ε = .658, ηp2 = .901. Pairwise comparisons further showed that novel similes (M = 5.73, SE = .08) were rated as more metaphorical than novel nominal metaphors (M = 5.53, SE = .08), p = .001, as well as than literal sentences (M = 1.86, SE = .07), p < .001. Additionally, novel nominal metaphors were rated as more metaphorical than literal utterances, p < .001. Figure 2 presents meaningfulness, metaphoricity, and familiarity ratings for the materials.
The overall means obtained from the normative tests confirmed desired differences between the conditions. Table 3 provides interscale correlations between stimuli dimensions, and Table 4 presents the summary of stimuli characteristics by sentence type.
Table 3 Study 1: Interscale correlations between stimuli dimensions Table 4 Study 1: Summary of stimuli characteristics by sentence type