The construct of imageability refers to how easily a particular word elicits a mental picture of a word’s referent (Toglia & Battig, 1978). In memory and word processing studies, imageability can be manipulated (Binder, 2007; Davis, 2010; Strain & Herdman, 1999; Strain, Patterson, & Seidenberg, 1995, 2002), analyzed as a predictor variable in multiple regression analyses (Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004; Cortese, Khanna, & Hacker, 2010), or controlled for (Cortese, 1998). However, to date, the availability of imageability information for multisyllabic words has been scant, at best. Thus, the use of imageability as a variable has been limited. For example, in their regression analyses of reading aloud and lexical decision performance for 6,115 multisyllabic words, Yap and Balota (2009) did not assess imageability because estimates were not available for the vast majority of the stimuli. The present work provides a set of imageability ratings and reactions times for 3,000 disyllabic words, to allow work on imageability to be extended to the study of multisyllabic words.

It is important to examine the influence of imageability on performance because of its theoretical implications. For example, Paivio (1971) has proposed that words are encoded on the basis of information obtained from the senses, and that this information can be classified as verbal and/or nonverbal. Imagery is thought to be an important aspect of the nonverbal code, and the images elicited by concepts are believed to serve as mediators in associations. In support of this dual-code hypothesis, highly imageable words are often better remembered than low-imageable words (Cortese et al. 2010; Rubin & Friendly, 1986).

In addition, imageability is theoretically important for computational models of word recognition. The theoretical framework of Seidenberg and McClelland (1989) proposes that fully interactive connections between orthographic, phonological, and semantic codes allow for the top-down influence of semantics on the computation of the orthographic and/or phonological code(s) in visual word recognition. In their computational model of reading aloud, Plaut, McClelland, Seidenberg, and Patterson (1996) demonstrated that damage to the semantic pathway can account for surface dyslexia. This supports the assumption that semantic information is involved in the visual word recognition process, and thus, semantic-level variables, such as imageability, should account for variance in visual word recognition performance. Consistent with this idea, Strain, Patterson, and Seidenberg (1995) found that the consistency effect (i.e., longer naming reaction times for words with inconsistent spelling-to-sound mappings, such as pint) was practically eliminated for high-imageable words (e.g., comb vs. cliff). However, this finding has been controversial, and Monaghan and Ellis (2002) as well as Ellis and Monaghan (2002) suggested that age of acquisition (AoA) is actually the variable that interacts with frequency and consistency (see also Strain & Herdman, 1999; Strain et al., 2002). In their analyses of 2,428 monosyllabic words, Balota et al. (2004) reported that imageability accounted for unique variance in reading aloud. However, in subsequent work in which both AoA and imageability were assessed simultaneously, Cortese and Khanna (2007) found that only AoA accounted for unique variance in reading-aloud reaction time performance.

Furthermore, it has been hypothesized that the typical lexical decision task emphasizes semantic processes because meaning differentiates words from nonwords (i.e., words have meaning but nonwords do not; see, e.g., Chumbley & Balota, 1984). In support of this idea, high-imageable words often lead to faster lexical decisions than do low-imageable words (Balota et al., 2004; Cortese & Khanna, 2007). Interestingly, James (1975) found that concreteness, a variable that is highly correlated with imageability, was related to lexical decision reaction times when pronounceable nonwords (e.g., blark) served as distractors, but not when nonwords that violated orthographic constraints served as distractors. This finding suggests that the relative influence of concreteness (and hence imageability) on lexical decision may depend on the level of lexical processing (e.g., semantic vs. orthographic-to-phonological mapping) required by the experimental context. The effect of imageability on reading-aloud performance also may depend on the list context (see, e.g., Zevin & Balota, 2000).

Recently, brain-imaging studies have indicated that highly imageable words produce different patterns of neural activity than do lower imageable words during word-processing tasks (see, e.g., Graves et al., 2010). In addition, Bedny and Thompson-Schill (2006) found that activity in the left superior parietal lobule, right posterior middle temporal gyrus, left superior frontal gyrus, and left fusiform gyrus increased as imageability increased. Klaver et al. (2005) found that highly imageable words activated the P600 wave in the hippocampus more than words of low imageability did during a recognition memory task. These neuroimaging studies suggest that imageability is an influential property of words, and they encourage further study. However, these imaging studies require a large number of stimuli from each condition. Thus, in order to conduct these studies of imageability’s influence on word processing, a large number of imageability estimates must be available. Graves et al. (2010) utilized six different sources when selecting their stimuli, including the estimates for 3,000 monosyllabic words collected by Cortese and Fugett (2004).

Finally, it is important to note that while the study of monosyllabic words has dominated the word recognition literature for many years, there has been a recent emphasis on the processing of multisyllabic words. Analyses of reading-aloud and lexical-decision performance for tens of thousands of words is now possible due to the completion of the English Lexicon Project (i.e., ELP; Balota et al., 2007). The ELP provides reaction time estimates in the reading-aloud and lexical-decision tasks for over 40,000 English words. While Cortese and Fugett (2004) provided imageability estimates for 3,000 monosyllabic words, imageability estimates are unavailable for most of the multisyllabic words in the ELP corpus. We note that in their analyses of the reaction time and accuracy rates of monosyllabic words, Balota et al. (2004) utilized the Cortese and Fugett estimates; however, the recent analyses on performance measures of multisyllabic words by Yap and Balota (2009) did not include imageability because estimates were not available for a large number of multisyllabic words.

The present study provides estimates for 3,000 disyllabic words that participants rated across two sessions. The procedure was very similar to that employed by Cortese and Fugett (2004). We expect that these estimates will be useful to researchers who are interested in analyzing performance from the ELP as well as those who wish to employ a large number of disyllabic words in their behavioral and neuroimaging studies of word processing.

Method

Participants

A group of 35 students enrolled in undergraduate psychology courses at the University of Nebraska Omaha (n = 26) and Creighton University (n = 9) participated for extra credit or course credit. Of the 11 participants for whom demographic information was available, 36% were male, 64 % female. They ranged from 20 to 30 years of age (M = 23.82, SD = 3.28) and had completed 14–18 years of education (M = 16.27, SD = 1.11). A total of 82% were Caucasian, 9% were Asian, and 9% other.

Stimuli

The stimuli were 3,000 disyllabic words. Item characteristics for the 3,000-word corpus are provided in Table 1. In addition, we computed correlations among imageability, frequency, length, and AoA (based on preliminary data collected on these stimuli in our lab). These correlations are presented in Table 2. The items were mainly monomorphemic, but very common multimorphemic words (e.g., breakfast) were also included. We started with 23,365 disyllabic words, having as our goal to narrow the list down to 3,000 words. Our previous research had indicated that, across a number of different tasks, participants could respond to about 3,000 words in less than 4 h (or two 2-h sessions). To achieve this goal, a group of 7 undergraduate research assistants inspected the large sample of disyllabic words and indicated the words that they knew; each research assistant perused a different sample of words. This reduced the sample to 15,434 words. From this reduced sample, we eliminated many multimorphemic words, but maintained some common multimorphemic words. We included a relatively large proportion of monomorphemic words because Yap and Balota’s (2009) studies on multisyllabic word processing utilized only monomorphemic words. As previously noted, Yap and Balota did not assess imageability because ratings were not available for most of these words. Thus, imageability estimates for a large sample of multisyllabic monomorphemic words would allow researchers to extend the work of Yap and Balota utilizing a large common set of stimuli. Although we ended up with 3,000 words, more words could easily have been included. Thus, the final sample of words was a representative but not an exhaustive sample of disyllabic words. In other words, there was no a priori reason for selecting these specific 3,000 words. Instead, we wanted a set of words that exhibited variability on a range of other factors. Our long-term goal was to collect normative data for as many English words as possible on key variables, such as imageability, AoA, and so forth. For example, the stimuli ranged in frequency from 0 to 71.21 occurrences per million (Brysbaert & New, 2009), and they ranged in length from 3 to 11 letters. Extending the number of words for which normative data are available would allow for further examination of how these variables relate to performance on reading aloud, lexical decision, and so on, for a large set of stimuli, as well as extending the number of stimuli that could be selected for future research endeavors. For example, the English Lexicon Project (Balota et al., 2007) expanded the number of words from which researchers can select according to certain factors (e.g., average naming latencies, lexical decision times, etc.) to 40,481 words. However, most of these words have not been normed for imageability (as well as other lexical characteristics). Thus, certain practical constraints limit the items that researchers can include in their experiments, as well as limiting the factors that can be assessed in large-scale analyses (see, e.g., Yap & Balota, 2009). By providing imageability estimates for 3,000 disyllabic words, our study would reduce these constraints.

Table 1 Item characteristics for the 3,000-disyllabic-word corpus
Table 2 Correlation matrix for imageability, frequency, length, and age of acquisition (AoA)

Procedure

The procedure of Cortese and Fugett (2004) was followed as closely as possible. A microcomputer was used to collect ratings in a laboratory. Two sessions of 1.25–2 h were conducted within one week of each other. Each session comprised two blocks of 750 words each, for a total of 1,500 words. Opportunities for breaks occurred after every 375 trials. A randomized file was generated anew for each set of 4 participants, and four blocks of 750 words were constructed from this file. Each participant from this set of 4 received the same words in each of the four blocks. However, the four blocks of trials were counterbalanced across each set of 4 participants according to a Latin square design, such that each block occurred equally often in Block 1, Block 2, and so on. Within each block of trials, the presentation order of the stimuli was random.

On each trial, a word was presented in the center of the screen and ratings were entered using the numerical keypad on the right-hand side of the keyboard. The instructions were the same as those given by Cortese and Fugett (2004), except that the example of a highly imageable word was changed from the mascot of Morehead State University to maverick (the University of Nebraska–Omaha mascot) for University of Nebraska at Omaha students, and bluejay (the Creighton University mascot) for Creighton University students (see the Appendix). Two other aspects of the procedure differed from the method of Cortese and Fugett. First, we assumed that responses made in less than 500 ms were made too quickly for an accurate assessment of imageability, so responses that occurred in less than 500 ms were not accepted and were followed by the message “response too fast – slow down!” which appeared at the bottom and center of the screen. The target word then reappeared after 2,000 ms to be (re)rated. The 2,000-ms delay was intended to discourage participants from responding too quickly on future trials. Second, responses that did not represent the numbers between 1 and 7 were rejected and resulted in the message “response invalid – try again” at the bottom center of the screen for 2,000 ms. The target word then reappeared after 2,000 ms to be (re)rated. In each of these situations, the second rating (as long as it took longer than 500 ms and represented a number between 1 and 7) was recorded as the estimate for that item.

Analyses, results, and discussion

The estimates reported here were compiled from 32 of the 35 participants. After we collected data from 32 participants, we screened the data in the following ways. First, the overall mean for each item was computed by averaging across the 32 participants. Second, a correlation coefficient was computed between each participant’s ratings and the overall average across the 3,000 items. From the set of correlation coefficients, a mean correlation (.578) and standard deviation (.180) were established. We removed the estimates from further analysis for 3 participants for whom the correlation coefficient was less than two standard deviations below the mean. This procedure was done to eliminate the data for participants who either were not representative of the population of interest or who were not taking the task seriously. In order to complete the counterbalancing of blocks (see the Procedure section), these estimates were replaced by estimates from 3 new participants. Of the 32 participants’ data that were retained, the correlation between each participant’s responses and the overall mean ranged between .24 and .78.

To establish criterion-related validity, we computed the correlation between the item estimates common to both our corpus and four other norm corpora. Our estimates correlated highly with Toglia and Battig (1978, N = 535), r = .88, p < .001; Gilhooly and Logie (1980, N = 433), r = .86, p < .001; Bird, Franklin, and Howard (2001, N = 292), r = .82, p < .001; and Stadthagen-Gonzalez and Davis (2006, N = 543), r = .86, p < .001, suggesting some validity and stability of imageability across these studies and across time. Previously, Cortese and Fugett (2004) had reported a correlation of .89 between their own and Toglia and Battig’s estimates of 1,153 monosyllabic items.

Although the speed of responding was not emphasized in our instructions, it was recorded. We found that the average reaction time across items was 1,825.51 ms, with a standard deviation of 479.78. This result was slightly longer and more variable than the values reported by Cortese and Fugett (2004), who reported an average reaction time of 1,423.2 ms with a standard deviation of 208.5. These differences are probably due to the length difference in the two sets of words. Furthermore, we found a negative relationship between the imageability estimates provided and the reaction times, r = −.58, p < .001. That is, participants provided estimates more quickly for high-imageable words than for low-imageable words. We note that this relationship is between those reported by Cortese and Fugett (2004, r = −.47) and Paivio (1968, r = −.70).

Combined with the estimates from our lab for monosyllabic words (Cortese & Fugett, 2004), we have now compiled imageability estimates for 6,000 words. This corpus has many potential uses. For example, we are using the estimates for disyllabic words to determine whether imageability accounts for unique variance in performance for reading aloud, lexical decision, and recognition memory. This work extends that conducted on both monosyllabic words (e.g., Balota et al., 2004; Cortese & Khanna, 2007; Cortese et al., 2010) and multisyllabic words (e.g., Yap & Balota, 2009). For monosyllabic words, imageability accounted for unique variance in lexical decision (Cortese & Khanna, 2007) and was the strongest of eight predictor variables of recognition memory performance (as measured by hits minus false alarms; Cortese et al., 2010). In addition, since current functional neuroimaging techniques require a large number of trials, these norms (as well as the combined set) facilitate stimulus selection for neuroimaging studies (e.g., Graves et al., 2010) as well as other types of studies.

In general, the use of large databases such as the ELP has been popular in recent years. Our view is that collecting as much normative information as possible for the words in these large corpora will prove useful. For example, the Semantic Priming Project (SPP; Hutchison et al., 2011) has amassed a corpus of reading-aloud and lexical-decision reaction times and accuracy measures for 3,000 targets in related and unrelated conditions. Previous work has demonstrated that the item characteristics of the prime and target account for unique variance in priming (see, e.g., Hutchison, Balota, Cortese, & Watson, 2008). In addition, Bleasdale (1987) found significant priming in a naming task when the primes and targets were homogeneous with regard to concreteness (e.g., chin–nose) or abstractness (e.g., truth–false), but not when they were heterogeneous (e.g., body–soul). As imageability correlates highly with concreteness, the present set of norms allows for these types of relationships to be further explored in the SPP and/or used for stimulus selection in future studies.

In summary, using procedures similar to those of Cortese and Fugett (2004), we have collected imageability estimates for 3,000 disyllabic words. Based on the correlation between our ratings and Toglia and Battig’s (1978) ratings, these estimates appear to be valid. We expect these estimates to be of use to researchers interested in the relationships between imageability and other variables.