The use of controlled stimuli is an essential component of the scientific process, so it is important to ensure that stimuli have been appropriately normed for the population and variables being tested. Oftentimes researchers will use a shared database of normed stimuli to ensure consistency across projects and laboratories. One such collection of normed stimuli is the set of literary and nonliterary metaphors generated by Katz, Paivio, Marschark, and Clark (1988). Katz et al. collected ratings on ten dimensions that could be used to describe metaphors: their comprehensibility, ease of interpretation, metaphoricity, metaphor goodness, imagery of the metaphor, imagery of the subject, imagery of the predicate, familiarity, semantic relatedness, and number of alternative interpretations. Because of the span of dimensions measured, this collection has been used in many studies since its publication (e.g., Bowdle & Gentner, 2005; Campbell & Raney, 2013; De Grauwe, Swain, Holcomb, Ditman, & Kuperberg, 2010; Diaz, Barrett, & Hogstrom, 2011; Diaz & Hogstrom, 2011; Gentner & Wolff, 1997; Kacinik & Chiarello, 2007; Kuiken, Chudleigh, & Racher, 2010; Schmidt & Seger, 2009; Thibodeau, & Durgin, 2011; Xu, 2010). The purpose of the present research was to replicate a portion of Katz el al.’s metaphor norms to determine whether their normative data are still valid.

There are several reasons to replicate Katz el al.’s (1988) norms. One important reason is that language is not stagnant, and interpretations of various figurative tropes could change over time. Therefore, it is reasonable to ask whether the metaphors used by Katz et al. are perceived and understood in the same way today as they were 25 years ago. For instance, the familiarity of the metaphors may have changed, or the conventionality of the base may have shifted over time (Bowdle & Gentner, 2005)—where the base refers to the final word of a metaphor (the base is also commonly referred to as the vehicle). For example, many conventional words, such as gold mine or blockbuster, are entrenched in our lexicon as figurative phrases, largely ignoring their literal origin.

Another reason to replicate Katz et al.’s (1988) norms is that different groups of people might respond to metaphors differently. As is typical of much research in psychology, Katz et al. used undergraduate students as participants, but note that the research was performed at a university in Canada. Although we have no reason to expect that Canadian students would process metaphors much differently than say, students in the United States, there is evidence that interpretations of metaphors may differ between cultures or geographic regions (Boers, 2003; Littlemore & Low, 2006). For instance, when the word ski is presented in an ambiguous context (e.g., I want to go skiing) to people who live in the state of Florida, their initial thought might be of water skiing, whereas people who live in Wyoming might initially think of snow skiing. Consequently, it is important to determine whether the normative ratings would replicate using a different population. Beyond knowing that Katz et al.’s participants were college students in introductory psychology, little is known about the participants. As Katz et al. pointed out, differences between individuals may impact the characteristics of the metaphor that are readily available to the reader. What sorts of factors may elicit these differences in ratings? One potential factor is the language backgrounds of the sample population used. For instance, whether normative ratings are collected using predominantly monolingual or bilingual participants might be important, because monolinguals and bilinguals have different linguistic and perhaps cultural experiences (Bortfeld, 2002; Colston, 2005).

Because the database created by Katz et al. (1988) has been and continues to be used extensively in research on metaphor processing, we believe that the norming data should be compared to data from a new generation of research participants, to examine the modern validity of the original norms. What follows is a renorming of a sample of the nonliterary metaphors included in Katz et al.’s study.

Methodology

Participants

Ninety undergraduate students from the University of Illinois at Chicago (UIC) participated in exchange for credit toward their introductory psychology course.Footnote 1 UIC’s student population represents one of the most diverse campuses in the United States, in that it is a minority-majority campus. This means that no one racial group comprises at least half of the total student population. UIC is also linguistically diverse. Approximately 52 % of UIC students self-report being multilingual, and another 29 % report being “somewhat” multilingual (i.e., not proficient in their second language). Furthermore, approximately 40 % of UIC students indicate that English is not their native language or that they have two native languages (i.e., they learned two languages from birth or from shortly after birth), but the vast majority of students had attended English-speaking schools prior to college, making them highly proficient English speakers.

Because of the diverse language background of the students at UIC, participants were required to have attended English-speaking schools for at least 10 years. This restriction was used to ensure that they had substantial knowledge of the English language. Of the 90 students tested, 60 self-reported being proficient bilinguals (66 %), and 60 (66 %) were native English speakers. This information was gathered through self-report language history questionnaires. Descriptive information about the participants is provided in Table 1.

Table 1 Average scores (and standard deviations in parentheses) on the vocabulary test (maximum = 30) and self-reported proficiency ratings (maximum = 10) for speaking, understanding, and reading English and for participants’ most proficient language other than English

Materials and apparatus

Fifty nonliterary metaphors were selected from the Katz et al. (1988) collection. Because this collection is oftentimes used for research exploring familiarity effects on metaphor processing, the metaphors were selected from a wide range of Katz et al.’s familiarity ratings. Specifically, 20 of the most familiar metaphors, 20 of the most unfamiliar metaphors, and 10 metaphors near the median familiarity score were selected. Additionally, given that syntactic structure can have a substantial impact on comprehension (Gentner & Wolff, 1997; Glucksberg, 2008), all of the selected metaphors followed the “X is a Y” format. For example, the metaphor Love is a flower follows this format, whereas Thunder clouds are draperies pulled across the sun does not. A full list of the metaphors used can be found in Appendix B.

The 50 selected metaphors were printed in packets with the metaphors in a predetermined, randomized order. Each metaphor was presented alone (i.e., with no context) and was followed by the ten norming questions Katz et al. (1988) had used to evaluate (1) comprehensibility, (2) ease of interpretation, (3) metaphoricity, (4) metaphor goodness, (5) metaphor imagery, (6) subject imagery, (7) predicate imagery, (8) felt familiarity, (9) semantic relatedness, and (10) number of interpretations. See Appendix A for a brief description of each of these dimensions from the instruction page of the norming packet, and Katz et al.’s report for a full explanation. The descriptions used for our norming packet were taken from Katz et al., and each domain was rated on a seven-point scale, with each scale being explained to the participants before they began the study.

As part of the metaphor norming packet, participants were given a language history questionnaire and an English vocabulary test (developed by Raney). The language history questionnaire allowed us to collect self-report information on the participants’ native languages, the number of languages they speak, and their relative strengths using each language. The vocabulary test had been used in a number of prior studies (Minkoff & Raney, 2000; Therriault & Raney, 2007) and is moderately correlated with English reading comprehension ability (rs = .40 to .52 in prior studies). The test consisted of 30 words presented in isolation, and the participants were asked to select the best meaning from among five alternatives. The vocabulary test was designed to be relatively difficult, with the average score being approximately 14–15 correct for a population of predominantly freshman college students.

Procedure

Participants were provided with a metaphor norming packet containing instructions, the metaphors, the language history questionnaire, and the vocabulary quiz, in that order. Each of the norming dimensions was described to the participants before they began rating the metaphors. This was particularly important for clarifying the subject and predicate imagery dimensions (the subject is often called the target, and the predicate is often called the vehicle or base, in metaphor research). Participants were instructed to complete the norming packet before completing the language history questionnaire and vocabulary test. After completing all three forms, the participants were debriefed and dismissed.

Results

We present the results in two sections. The first section reports overall comparisons between the Katz et al. (1988) and UIC ratings. Interscale correlations between the ten dimensions are then provided for the UIC data (see Katz et al., 1988, for their interscale correlations). The second section reports comparisons of the Katz et al. ratings and the UIC ratings when the UIC participants were divided into subgroups on the basis of their vocabulary scores and language background.

Overall comparisons

Ratings and correlations

Average ratings for each of the ten norming dimensions were collected for each metaphor and then correlated with the ratings collected by Katz et al. (1988), using Pearson correlations. Table 2 presents the average ratings for the UIC and Katz et al. participants. The ratings of the selected metaphors have remained highly consistent over time, as is indicated by the presence of significant positive correlations between the UIC and Katz et al. ratings for all ten dimensions. The correlations range from .56 to .78, with six of the ten correlations exceeding .70. Of particular importance is the high correlation for the familiarity dimension (r = .78), which indicates that the metaphors that were rated as familiar 25 years ago are still considered familiar today, and the metaphors that were rated as unfamiliar have remained relatively unfamiliar.

Table 2 Average ratings (maximum = 7, with standard deviations in parentheses) for each metaphor dimension, correlations between the Katz et al. (1988) and UIC ratings, and difference scores between the Katz et al. and UIC ratings

The absolute ratings have increased slightly over time for some of the dimensions. The important finding, however, is that the relative ratings have remained stable, as is indicated by the correlations between the Katz et al. (1988) and UIC ratings. Researchers typically use normative ratings to access metaphors that are relatively high or low in familiarity, such as the top or bottom quartiles. The absolute values of the ratings are secondary to the relative ratings.

Difference scores

Although the relative ratings across dimensions have remained consistent, Table 2 shows that the magnitudes of the average ratings have shifted slightly over time. Specifically, the average ratings for eight of the ten dimensions are slightly larger for the UIC population than for the Katz et al. (1988) population. Ratings for two of the dimensions (metaphoricity and metaphor imagery) have increased by over one point (one point equals approximately a 14 % change). To evaluate the change in ratings, independent-samples t tests between the average UIC ratings and the Katz et al. ratings were run for each dimension (see Table 3). These tests were based on the average rating for each dimension for each participant (i.e., one score per participant per dimension). With the exception of comprehensibility, ease of interpretation, and semantic relatedness (ts < 2.0, n.s.), the ratings from UIC were significantly higher than the Katz et al. norms (all ps < .05).

Table 3 Independent-samples t tests comparing average UIC ratings and the average ratings collected by Katz et al. (1988) for each norming dimension

Interscale correlations

These correlations were computed for each of the ten scales to determine how related they are to each other, as had been done by Katz et al. (1988). Table 4 shows that all of the dimensions are strongly correlated with one another. It is worth noting that the interscale correlations found with the UIC population are larger than those collected by Katz et al. The average interscale correlation reported by Katz et al. was .76, whereas the UIC average is .94.

Table 4 Interscale correlations between metaphor dimensions

There are two potential reasons for the higher interscale correlations for the UIC data. The first reason is that the UIC data are based only on nonliterary metaphors in the “X is a Y” format, in which X and Y are single words. Thus, the type of metaphors evaluated for the UIC norms was more restricted than the types of metaphors included in Katz et al.’s (1988) norms. To evaluate this possibility, we recalculated the interscale correlations for Katz et al.’s norms on the basis of the 50 metaphors included in the present study. The average interscale correlation increased from .76 (Katz et al.’s, 1988, original average) to .81. Restricting the metaphors led to a small increase in the average interscale correlation, but certainly not to a point equal to our average interscale correlation. This makes sense, given that the 50 ratings were taken from a database in which participants rated the full set of 260 nonliterary metaphors; thus, calculating interscale correlations on the subset of 50 items does not minimize the range of metaphors actually rated by Katz et al.’s participants.

A second potential reason for the difference in the interscale correlations is that Katz et al. (1988) had separate groups of participants rate all of the metaphors on a single dimension, rather than having a single group of participants rate the metaphors on all ten dimensions (as we did). To evaluate whether this methodological change influenced the size of the interscale correlations, we conducted a follow-up experiment in which participants rated the metaphors on a single dimension. We included three dimensions (comprehensibility, familiarity, and number of alternative interpretations). The follow-up study is reported fully in Appendix C. The key finding from the follow-up study is that changing the procedure did not substantially influence the results. The average correlations between the UIC data and Kata et al.’s data were .67, .82, and .62 for comprehensibility, familiarity, and number of alternative interpretations, respectively. The average interscale correlation between these three dimensions was approximately .9. Thus, for these dimensions, asking participants to rate all of the dimensions or to rate a single dimension did not change the pattern of results. This does not explain why the interscale correlations are high for the UIC data, but this eliminates the methodological explanation.

Individual differences

The UIC metaphor ratings were reevaluated to determine whether individual differences in English vocabulary knowledge and language history influenced the ratings. Specifically, participants were divided into low- and high-vocabulary groups and into three language groups, based on whether they were native or nonnative English speakers and whether the native English speakers were proficient or nonproficient bilinguals.

Vocabulary

The participants were divided into low and high vocabulary knowledge groups based on a median split of their vocabulary scores. Low-vocabulary participants scored 13 points or lower out of a possible 30 on the vocabulary test, with the average score being 11.2. The average score for the high-vocabulary group was 17.8. A t test showed that the difference in vocabulary scores between the groups was significant, t(88) = 10.6, p < .01.

The average ratings for both vocabulary groups for each of the ten norming dimensions, as well as the correlations with the Katz et al. (1988) ratings, can be found in Table 5. For both the low- and high-vocabulary groups, we found significant correlations between the UIC ratings and Katz et al.’s ratings. Examination of the ratings in Table 5 gives the impression that the average ratings for most dimensions are slightly higher for the low-vocabulary than for the high-vocabulary group. To evaluate this possibility, we compared the average ratings, collapsed across the ten dimensions, for the low- and high-vocabulary participants (i.e., one data point per participant to represent the average rating across all ten dimensions). Average ratings were not significantly different between the vocabulary groups [t(998) = 0.14, n.s.].

Table 5 Average ratings (maximum = 7, with standard deviations in parentheses) for low- and high-vocabulary UIC students, and correlations between the Katz et al. (1988) and UIC ratings based on vocabulary score

Most importantly, the correlations between each vocabulary group and the Katz et al. (1988) ratings were large and statistically significant for every dimension. Participants in the low-vocabulary group appear to have smaller correlations than the high-vocabulary group for nearly every dimension, but the average correlation across the ten dimensions was not statistically lower for the low-vocabulary group, t(18) = –1.6, n.s. In essence, vocabulary knowledge did not significantly affect the magnitude of the UIC ratings or the correlations with the Katz et al. ratings.

Language history

Participants were split into three groups based on their responses on the language history questionnaire. Group 1 consisted of native English speakers who were nonproficient bilinguals (n = 30). Most of these individuals had some experience with a second language, usually from learning it in a classroom, but did not consider themselves proficient in speaking, understanding, or reading the language. Only four individuals in Group 1 considered themselves purely monolingual, with no experience learning a foreign language. Group 2 consisted of native English speakers who were proficient bilinguals (n = 30). These individuals reported themselves as having first learned English, but they were also capable of proficiently speaking a second language (usually learned at an early age). The second language was often learned in a school setting as well as at home. Group 3 consisted of nonnative English speakers who were proficient bilinguals (n = 30). The individuals were proficient in English and in another language that had been learned prior to acquiring English, which was typically their native language. The second language was often learned in a school setting as well as at home.

Table 6 presents the average ratings for the three UIC language groups, as well as the correlations between the ratings from each language group and the Katz et al. (1988) participants. Across all ten dimensions, the ratings for each language group were significantly correlated with the Katz et al. ratings. A one-way between-subjects analysis of variance (ANOVA) was performed on the average ratings for the three language groups (with the average rating across all ten dimensions for each participant being entered as a data point) to determine whether the average ratings differed across language groups. The average ratings across the ten dimensions were not significantly different between the three language groups, F(2, 1497) = 1.7, n.s.

Table 6 Average ratings (maximum = 7, with standard deviations in parentheses) for UIC students as a function of their language background, and correlations between the Katz et al. (1988) and UIC ratings based on the language background of the UIC students

The average correlations between each language group and the Katz et al. (1988) data were also compared using a one-way, between-subjects ANOVA (based on one overall average correlation per participant). The average correlation was significantly different between groups, F(2, 27) = 3.4, p < .05. Post hoc comparisons showed that Group 1 (M = .72) had a statistically larger average correlation with the Katz et al. data than did Group 3 (M = .63). The sizes of the correlations with the Katz et al. data did not differ between Groups 2 (M = .66) and 3 (M = .63).

General discussion

Our findings support the conclusion that the Katz et al. (1988) norms for nonliterary metaphors of the “X is a Y” format remain highly valid. We found significant correlations between the UIC and Katz ratings for all ten dimensions. Although the magnitudes of the ratings for our participants were slightly larger for some of the dimensions, relative familiarity, meaningfulness, and so forth have remained highly consistent for the metaphors normed here. That is, metaphors that were rated as more familiar 25 years ago were still rated as more familiar now, and metaphors that were rated as less familiar 25 years ago were still rated as less familiar now. The same conclusion holds for the other dimensions. In general, the metaphors in the UIC sample are as comprehensible and as easy or difficult to interpret now as when they were originally normed. Likewise, individuals are generally able to create the same number of interpretations as before, the degree of imagery invoked by the metaphors and their components has remained consistent, and so forth. As such, we can be confident in past and future research that is based on this metaphor collection. The consistency of the familiarity ratings is especially important, because metaphor familiarity is a central component of several models of metaphor processing and has been the focus of much research (Blasko & Briihl, 1997; Blasko & Connine, 1993; Bowdle & Gentner, 2005; Campbell & Raney, 2013; Gentner & Wolff, 1997).

One might wonder why the ratings were slightly higher for some of the dimensions in the UIC data than in the Katz et al. (1988) data. One possible explanation for the higher metaphoricity and familiarity ratings is that metaphors may be encountered more frequently today than 25 years ago. For instance, with increased interactions through brief e-mails and text messaging, people may more frequently use figurative language to quickly express themselves. It is also possible that students might rate any form of language as being more familiar today, and the higher ratings may have nothing to do with knowledge of metaphors per se. These explanations are purely speculative and warrant further study. The important point is that the relative ratings have remained stable over time, as reflected by the strong correlations between the UIC and Katz et al. data. This allows us to be confident in past and current research based on the metaphors in the Katz et al. norms.

The present data also have implications for research based on diverse populations. Large and statistically reliable correlations were found between the UIC ratings and the Katz et al. (1988) ratings, based both on all participants and on the participants divided into groups based on vocabulary knowledge and language background. Vocabulary knowledge, which is moderately correlated with reading comprehension ability, did not affect the magnitude of the ratings or the size and pattern of the correlations between the UIC and Katz et al. ratings. This implies that the norms are valid for college students of all vocabulary levels, as long as they are proficient speakers of English (defined in the present study as having at least 10 years of education in a setting in which English has been spoken).

Our results also demonstrate that the Katz et al. (1988) database is valid for both native and nonnative English speakers, as long as they are proficient speakers of English. We found large and statistically reliable correlations between the UIC and Katz et al. ratings for native English speakers who were not proficient bilinguals (Group 1), for native English speakers who were proficient bilinguals (Group 2), and for nonnative English speakers who were proficient bilinguals (Group 3). Interestingly, Group 1 consistently had the largest correlations with the Katz et al. data. The larger correlations might be due to the fact that these individuals speak English almost exclusively, and therefore they might have more experience with the selected metaphors. The two bilingual language groups (2 and 3) produced similar ratings and correlations across all ten dimensions. This might be due to the fact that these individuals regularly use multiple languages, and their levels of exposure to the selected metaphors are therefore similar. Future research could examine these speculative explanations regarding exposure to metaphors.

One general implication of our findings is that how linguistic background influences performance is complex. Linguistic background had little effect on the relative ratings of metaphors, such as their relative familiarity. In contrast, it had a reliable effect on the magnitude of ratings, with bilinguals generally rating metaphors as less familiar, for example. This pattern held for both native and nonnative English speakers. How language background influences the comprehension of metaphors remains an important topic for future research.

Some potential limitations need to be considered regarding the present study. First, we used a methodology modified from the one used by Katz et al. (1988). As we mentioned earlier, Katz et al. had separate groups of participants rate the metaphors on a single dimension, whereas we had a single group of participants rate the metaphors on all ten dimensions. Our follow-up experiment (see Appendix C) indicated that the same patterns of results were found using each methodology; therefore, we are confident that the relative ratings remain consistent no matter which methodology is employed. A second potential limitation is that we used a subset of the 264 metaphors normed by Katz et al.—specifically; nonliterary metaphors having the “X is a Y” format, in which X and Y are single words. This could reduce the variability in the ratings relative to rating metaphors of mixed formats. Future studies could evaluate this possibility.

In summary, our findings support the conclusion that Katz et al.’s (1988) normative ratings of literary metaphors remain valid as long as the research participants are proficient English speakers. Whether a participant has a low or high vocabulary or is a native or nonnative English speaker does not impact the pattern of ratings for this collection of metaphors.