Word frequency effects are classic and robust in psycholinguistic research, predicting performance across many tasks, including lexical decision (Rubenstein, Garfield, & Millikan, 1970; Stone & Van Orden, 1993), naming (Forster & Chambers, 1973; Waters & Seidenberg, 1985), and perceptual identification (Manelis, 1977). Even in silent reading, word frequency has powerful effects on eye movements and fixation durations (Inhoff & Rayner, 1986; Rayner & Duffy, 1986). Despite the ubiquity of frequency effects, questions remain about their locus in word processing: Are frequency effects restricted to early processes of perception and lexical access, or do they continue into postaccess processes? A related question concerns the automaticity of lexical access: Does word perception require central attention, and might cognitive demands differ across high-frequency (HF) and low-frequency (LF) words?

The word frequency effect is most commonly assessed using response times (RTs); across variations in methods, stimuli, and participants, people are faster to process HF words, relative to LF words. Critically, however, frequency interacts with many other variables, such as context (Becker & Killion, 1977), stimulus quality (Yap, Balota, Tse, & Besner, 2008), word repetition (Scarborough, Cortese, & Scarborough, 1977), and neighborhood density (Andrews, 1992; Grainger & Segui, 1990). Frequency effects also change across tasks; for example, they are approximately three times larger in lexical decision, relative to naming. As suggested by Balota and Chumbley (1984), this interaction suggests that the decision stage (a necessary component for lexical decision, but not for naming) is frequency-sensitive. Thus, frequency appears to affect processes beyond perception in lexical decision. Indeed, Balota and Abrams (1995) later reported that frequency influences response mechanics (e.g., arm movement force) in lexical decision. In contrast, the naming task does not entail a decision stage, and is often assumed to provide a “cleaner” estimate of frequency (and other lexical) effects.

The delayed naming task

In a standard naming task, participants try to correctly name stimulus words as quickly as possible following their presentation (e.g., Forster & Chambers, 1973). Given this procedure, however, an RT difference between LF and HF words might reflect lexical access, or it might reflect uncontrolled differences in articulatory complexity (or other factors) across the words. To address this issue, the delayed naming task was developed. In delayed naming, the participant sees a word and then waits, often for variable intervals, for a “go” signal to initiate the naming response. The logic is simple: Because lexical access is allowed to finish, any remaining frequency (or other) effect should arise from speech planning or execution. As reviewed by Goldinger, Azuma, Abramson, and Jain (1997), frequency effects have been inconsistently reported in delayed naming. Indeed, the procedure was originally used as a control task, to ensure that frequency effects in naming times did not unduly reflect phonetic variations across items (Andrews, 1989; Forster & Chambers, 1973). Conversely, in a well-known study, Balota and Chumbley (1985) had participants view HF and LF words, wait for response cues at various delays, and then quickly speak. Balota and Chumbley observed reliable frequency effects following relatively long delays (650 ms or longer), suggesting that frequency affected postaccess processes. These findings have been replicated and extended (Connine, Mullennix, Shernoff, & Yelen, 1990), but have also been questioned (McRae, Jared, & Seidenberg, 1990). For example, whenever participants must produce different words, there remains a potential for uncontrolled phonetic variations, leading to differences in voice-key activation.

To address the issue of interword phonetic variations, Goldinger et al. (1997) modified the delayed naming task not only to wait out lexical access, but also to equate production, leaving only the speech-planning process free to vary. To accomplish this purpose, they used a dual task/change procedure. Most trials (80%) entailed “standard” delayed naming, in which participants saw words, waited for variable delays, and spoke the words in response to a high-pitched tone. In the remaining, catch trials, a low-pitched tone indicated that participants should abandon the speech plan, quickly saying “blah” instead. Using delays ranging from 150 to 1,400 ms, Goldinger et al. (1997) observed robust frequency effects in both standard and catch trials, with slower responses following LF words. That is, people were slower to produce LF words, as well as slower to switch away from LF words to the “blah” response. The results suggested that word frequency affects speech planning, independent of speech production, with LF words demanding extra attention. Still, other researchers have employed the delayed naming procedure and have not observed any reliable frequency effects (McCann & Besner, 1987; McRae et al., 1990; Savage, Bradley, & Forster, 1990). In general, frequency effects in delayed naming are fragile, appearing and disappearing with changes to the stimuli and methods. As such, it is difficult to conclude that word frequency truly affects lexical processing beyond the early stages of perception.

Pupillometry

Given such fragile effects in delayed naming, one goal of the present study was to widen our focus, expanding from relying on individual RTs to examining a more continuous estimate of cognitive effort. Our specific approach was to collect continuous estimates of pupil dilation while participants performed a modified delayed naming task, as in Goldinger et al. (1997). This approach allowed for the assessment of potential frequency effects in delayed naming, independent of the presence or absence of an RT effect. Although the pupils change reflexively in response to general factors, such as emotional arousal, stress, and anxiety, such tonic changes are independent of phasic changes, which are based on the onset of stimuli for cognitive processing (Karatekin, Couperus, & Marcus, 2004). Phasic changes in pupil diameter have been used to infer cognitive effort across many domains, including lexical decision (Kuchinke, Võ, Hofmann, & Jacobs, 2007), attention allocation (Karatekin et al., 2004), working memory load (Granholm, Asarnow, Sarkin, & Dykes, 1996; Van Gerven, Paas, Van Merriënboer, & Schmidt, 2004), face perception (Goldinger, He, & Papesh, 2009), and general cognitive processing (Granholm & Verney, 2004).

The appeal of pupillometry for the investigation of cognition lies in its relative independence: Pupil dilation and constriction occur automatically, so cognitive effort can be gauged without influences from task-specific strategies. Prior research has shown that, when people perform more difficult cognitive operations, their pupils dilate (Porter, Troscianko, & Gilchrist, 2007), which may represent the summed index of the required brain activity (Beatty, 1982). Task-evoked peak pupil diameters have been used to index the cognitive load associated with memory tasks (Kahneman & Beatty, 1966), showing that pupils dilate relatively quickly with the onset of mental effort and constrict relatively quickly after its cessation. Indeed, Kahneman (1973) used the pupillary reflex as his primary index of mental processing load in his theory of attention, citing its sensitivity to variations within or between tasks and its ability to reflect individual differences in cognitive ability. Relevant to the present study, Kuchinke et al. (2007) recorded pupil dilation during a lexical decision task, using HF and LF words of varying emotional valences. Average peak pupil dilation in response to LF words significantly exceeded peak dilation to HF words (approximately 1,200 ms after stimulus onset), suggesting that pupil dilation can indirectly reflect the cognitive effort devoted to word processing. Given their use of lexical decision, however, the role of the decision stage remains unclear. A central aim of the present study was to assess “hidden” frequency effects in lexical processing without requiring lexical decisions.

When using pupillometry, particular caution must be exercised in stimulus generation and development of the procedural details. The pupils dilate reflexively to changes in the luminance, color, and spatial frequencies of visual input. Therefore, care must be taken to equate, as closely as possible, the stimulus characteristics across conditions. Porter and Troscianko (2003) identified several methods to minimize unwanted pupillary reflexes, including the use of relatively low stimulus contrast, avoiding colored stimuli, and using relatively long stimulus exposure durations. To address these considerations in the present investigation, we maintained constant background lighting and a constant screen color and (of greatest importance) equated the visual features of our stimulus words as closely as possible. Although we presented words briefly (500 ms), we examined extended time courses for pupillary reflexes, by utilizing response tone delays up to 2,000 ms and extending trials up to 6,000 ms postresponse. Examining an extended time course is critical, because pupillary reflexes appear several hundred milliseconds after the onset of stimulus processing (see Kuchinke et al., 2007). It is important to note, however, that pupillary responses to cognitive activity occur within approximately 250 ms (Beatty, 1982) and can be used to examine activities (such as reading; Heitz, Schrock, Payne, & Engle, 2008) that extend continuously in time.

In the present research, we adopted the modified, dual-task delayed naming paradigm used by Goldinger et al. (1997) and continuously monitored participants’ pupil diameters to estimate cognitive effort during speech planning. After reading briefly presented words, the participants waited for a response tone to indicate one of two responses, with different tones signaling either a standard naming response or a catch response (i.e., “blah”) .Footnote 1 Response tones occurred following one of four randomly selected delays, ranging from relatively short (250 ms) to relatively long (2,000 ms). Given the nature of phasic pupillary reflexes, we expected to observe diverging pupil diameters for LF and HF words across longer delays. Specifically, if word frequency influences speech planning, preparing to speak LF words should demand greater cognitive resources, reflected by relatively enlarged pupils, following the offset of the printed word. Furthermore, this result should persist even when participants eventually execute a catch response, in which HF and LF words are phonetically equal. That is, we expected pupil dilation to reflect the greater effort required to abandon the speech plans for LF words. Although results in delayed naming RTs have been equivocal, we expected to observe a physiological manifestation of frequency effects by examining pupil diameters across standard and catch trials. In theoretical terms, such a finding would indicate that speech planning requires central attention, and that its demands vary across words in a frequency-sensitive manner. To foreshadow our results, we observed a frequency effect in standard delayed-naming RTs, but not in catch-trial RTs. However, we did observe clear differences in the allocation of cognitive effort across HF and LF words, which persisted for several hundred milliseconds after naming responses.

Method

Participants

A group of 43 native English speakers (18–30 years of age, M = 19.50, SD = 2.2) from Arizona State University participated for partial course credit. All of the participants reported normal or corrected-to-normal vision and no hearing difficulty. Two of the participants were dropped from the analysis because of missing pupil data (greater than 7% of their data).

Stimuli

We compiled a list of 150 word pairs, matched for visual features, length, and syllables, but differing in rated word frequency (see Table 1 and Appendix A). To match the visual features, we selected word pairs that differed by one letter (e.g., few/pew) or the order of letters (e.g., great/grate), selecting homophones whenever possible. The LF and HF word sets were closely matched for phonetic onsets, with a slight bias in favor of LF words (i.e., more initial stops). The HF words featured 88 initial stops, 43 initial fricatives, and 19 initial liquids/glides, and the LF words featured 92 initial stops, 38 initial fricatives, and 19 initial liquids/glides. The words were pseudorandomly assigned to five lists, such that each list contained 60 words (48 standard naming and 12 catch), which were presented to participants in random order. Across participants, all words were used in catch trials equally often. Response tones were generated using NCH Tone Generation software. Of the participants, 22 discriminated between tones of 900 and 400 Hz, and the other 21 discriminated between tones of 750 and 550 Hz.Footnote 2

Table 1 Summary statistics for the stimulus set

Procedure

Participants were tested individually in a dimly lit, sound-attenuated booth. A chinrest maintained head position and viewing distance at 60 cm. The stimuli were presented in lowercase, 24-pt black Arial font on a constant gray background (RGB 150) at 1,024 × 768 resolution on a 17-in. monitor. Eye movements and pupil diameter were continuously recorded binocularly at 50 Hz, using a Tobii 1750 eyetracker. Naming latencies were recorded by an SR response box with a voice key. The experiment was controlled and data were collected using E-Prime 1.1 software (Psychology Software Tools, 2006).

Participants were first familiarized with the experiment and the eyetracker. The chinrest was adjusted such that eye position was maintained centrally on the horizontal axis and slightly above center on the vertical axis, and then participants were calibrated. For the calibration routine, we randomly presented nine fixation points (indicated by a blue dot) over the range of the display; participants “followed the dot” as it moved to each location. If the software or researcher identified any missing fixations, the calibration routine was repeated. All of the participants were successfully calibrated within two attempts.

The experiment began with a tone identification task, wherein participants judged the response tones as being either high- or low-pitched by pressing a corresponding response key. All of the participants completed 6 trials with 100% accuracy. After reading the instructions and having the task verbally explained, participants completed 6 practice trials (half standard and half catch). If the participant committed an error on any trial, the researcher reminded them of the instructions before proceeding with the experimental trials. Each experimental trial proceeded as outlined in Fig. 1, with 240 standard naming trials and 60 catch trials. Following the offset of a 1,500-ms fixation cross, participants were shown a single word for 500 ms. After the offset of the word, participants waited over one of four delay periods (250, 500, 1,000, or 2,000 ms) before a pseudorandomly selected high- or low-pitched tone indicated that a verbal response should be made. High-pitched tones signaled standard naming trials; low-pitched tones signaled catch trials. The tones were selected to maintain the ratio of standard to catch trials used by Goldinger et al. (1997); only 20% of the trials were catch trials, a percentage meant to encourage standard speech planning during the delay period. Following their verbal responses, participants were given at least 4,000 ms (but up to 6,000 ms) for their pupils to constrict to a preestablished baseline range.Footnote 3 Short breaks were permitted after every 60 trials.

Fig. 1
figure 1

Schematic outline of a single experimental trial

Results

Trials with errors (i.e., mispronunciations, pretone responses, and voice key errors) were removed from the analysis.Footnote 4 Missing pupil data (due to blinks) were filled in by linear interpolation, and peak diameters were averaged during specific trial events (fixation, word presentation, wait, tone, response preparation, response, and postresponse intertrial interval [ITI]). The missing pupil data were not systematically distributed across experimental variables, and none of the 41 participants whose data were retained had more than 6% missing data. Baseline-corrected peak pupil diameters were computed, on a trial-by-trial basis, as the difference between the average of the participant’s pupil diameter during the fixation cross and the event peak of the current trial. This procedure minimized the influence of “carryover” effects from the word frequency or difficulty of the preceding trial. All pupil analyses were conducted on baseline-corrected data from each participant’s right eye. For all analyses, alpha was maintained at .05, and multiple comparisons were corrected with Bonferroni adjustments.

Naming latency

A total of 5 participants with excessive naming errors (greater than 10% of trials) were excluded on a case-wise basis, leaving 36 in the analysis. The naming RTs for all trials are summarized in Table 2. As shown, we observed robust effects of delay (primarily in the catch trials), but the effects of word frequency were inconsistent. These data were first analyzed in an omnibus 2 (word frequency: HF or LF) × 2 (trial type: standard or catch) × 4 (delay: 250, 500, 1,000, or 2,000 ms) within-subjects repeated measures (RM) ANOVA. Although a main effect of delay, F(3, 33) = 55.18, p < .001, η 2p = .83, indicated that participants responded faster as the delay increased, an interaction between delay and trial type, F(3, 33) = 45.25, p < .001, η 2p = .80, revealed that this pattern was only evident during catch trials. Standard trials (784 ms) were faster than catch trials (1,169 ms). This main effect of trial type, F(1, 35) = 146.63, p < .01, η 2p = .81, reflected the difficulty associated with task switching. In the omnibus analysis, we did not observe a reliable effect of word frequency (F < 1); the average naming latency for HF words (979 ms) was equivalent to that for LF words (974 ms).

Table 2 Average naming response times (in milliseconds) in all conditions

Following the omnibus analysis, we conducted separate 2 × 4 (Word Frequency × Delay) RM ANOVAS on the standard trials and the catch trials. In the standard trials, we observed a reliable, 17-ms effect of word frequency, F(1, 35) = 7.48, p < .01, η 2p = .18. The main effect of delay was marginal, F(3, 105) = 2.46, p = .07, η 2p = .06, reflecting the small (20-ms) decrease in average naming RTs across delays. The interaction was not reliable (F < 1). In the catch trials, the main effect of word frequency was not reliable (F < 1), but there was a robust effect of delay, F(3, 105) = 67.92, p < .01, η 2p = .66, reflecting the large (964-ms) decrease in RTs across delays. The interaction was not reliable (F < 1). Taken together, the naming latencies partially replicated the prior results from Goldinger et al. (1997), although two effects (delay in standard trials and frequency in catch trials) were weaker than had been previously reported. We consider these results in the Discussion section, after examining the pupillary results.

Pupil dilation

Baseline-corrected peak pupil diameters were analyzed in separate 2 (word frequency) × 2 (trial type) × 4 (delay) within-subjects RM ANOVAs for each of the seven trial events. That is, pupil diameters were not examined in veridical “real time,” but in time-invariant sequences of trial events. For example, the “postresponse” trial period in the 250-ms delay condition occurred in real time at approximately 2,100 ms; this was the real-time equivalent of the “wait” trial period in the 2,000-ms delay condition. Because of such timing differences, pupillary responses were analyzed according to trial events. (Note, however, that the same results were observed in a set of real-time analyses.) Complementary analyses were carried out for each delay condition separately, on baseline-corrected, event-locked peak pupil diameters. For ease of presentation, these results are discussed in order of the trial events, as peak millimeter differences from baseline (mmd). The full ANOVA results can be found in Appendix B; only results of theoretical interest are discussed in the text.

The relatively late-arriving characteristic of the pupillary response suggests that reliable effects should occur at least 200–300 ms after a relevant cognitive event (Beatty, 1982). Thus, we do not address analyses on the fixation and word trial events (indeed, there was no effect of word frequency while words were on screen, p = .47). After word offset, participants waited for variable delay periods before hearing the response tone. During this wait period, we observed a main effect of trial type: Although participants had no way of knowing what trial was coming, their peak diameters were reliably larger during standard (0.17 mmd) than during catch (0.14 mmd) trials. We suspect that this effect is spurious, driven primarily by having relatively few catch trials. We also observed a strong main effect of delay during the wait period, and a marginal effect of word frequency. As is shown in Fig. 2, as the delay period grew longer, participants’ peak pupil diameters increased, with a general trend toward larger pupil diameters in trials with LF words.

Fig. 2
figure 2

Average baseline-corrected peak pupil diameters during the variable wait period, collapsed across trial types. Error bars represent standard errors. * p < .05

After hearing the response tone, participants either initiated execution of the predominant naming response or switched to prepare (and execute) the alternative “blah” response. During this response preparation period, we observed main effects of delay and word frequency. As was evident during the wait period, longer delays produced greater peak dilation. Consistent with our predictions and previous findings (Kuchinke et al., 2007), the main effect of word frequency revealed that LF words elicited greater dilation (0.38 mmd)than did HF words (0.34 mmd). Consistent with the notion of cognitive effort, catch trials placed a greater demand on participants, relative to standard trials. Although this trend was consistent across all delays, the Delay × Trial Type interaction indicated that the difference was only reliable in the 250- and 500-ms delay conditions: Preparing a “blah” response resulted in enlarged pupils at short delays. As is shown in Fig. 3, the difference between LF and HF words was numerically larger when participants planned a catch trial response, suggesting that switching away from the predominant speech plan was more resource-demanding for LF words (0.35 mmd) than for HF words (0.33 mmd).

Fig. 3
figure 3

Baseline-corrected peak pupil diameters for high-frequency (HF) and low-frequency (LF) words during each trial period for standard (top panel) and catch (bottom panel) trials, collapsed across delay conditions. Error bars represent standard errors. * p < .05. ** p < .01

When participants correctly issued verbal responses, we again observed several reliable patterns. There was a frequency effect: Issuing an LF response (either standard or catch) demanded greater effort (0.25 mmd), relative to emitting an HF response (0.18 mmd). Although it was present in both standard and catch trials, the frequency effect was stronger in catch trials (η 2p = .62) than in standard trials (η 2p = .14). Actually producing the catch response, however, demanded fewer resources (0.16 mmd), relative to standard naming (0.27 mmd). We interpret this finding as a speech–motor benefit for repeated items. On average, participants successfully issued more than 50 “blah” responses per session. Over time, the associated motor program may have become well-practiced, demanding fewer resources.

After participants issued their spoken responses, they waited for variable ITIs (up to 6,000 ms) before the start of the next trial. During the postresponse period, we again observed a reliable main effect of word frequency, suggesting that the physiological reflection of the effort devoted to LF words levels off slowly, relative to the HF words: Whereas pupil diameters following HF words began to quickly constrict during this period (0.45 mmd), diameters following LF words were still more dilated (0.49 mmd). The main effect of delay was again evident, but was only reliable in catch trials.

Considering this postresponse period, one aspect of the results in Fig. 3 requires further explanation. Specifically, it appears that pupil diameters increased in the postresponse period, relative to the moment when participants actually spoke. As was shown by Kahneman and Beatty (1966), pupils typically constrict immediately upon the cessation of a cognitive load. The apparent increase shown in Fig. 3 is actually an artifact of the naming procedure. In a naming task, RTs are recorded from the moment a voice key is triggered and are typically the only dependent measure. In the present study, we continued collecting pupil data throughout the spoken response, when effort was still being expended (movement also affects the pupil reflex; see Richer & Beatty, 1985). Because Fig. 3 represents peak dilation across trial events, each value represents the average maximum, rather than the true average. To better illustrate the underlying pupil waveforms, Fig. 4 shows average pupil diameters for all standard and catch trials in the 500-ms delay condition. For clarity, the data are shown as raw values in millimeters, not corrected for baseline differences. As the figure makes evident, average pupil diameters were generally higher for LF words throughout all trial periods, with the peak typically corresponding to the spoken response in standard trials, and to the “switch” decision in catch trials. The upward deflection following the tone onset was also sharper for catch relative to standard trials, indicative of task-switching effort. When they are viewed as direct averages, it is evident that pupil diameters indeed returned toward baseline once responses were issued.

Fig. 4
figure 4

Average pupil diameters (in millimeters) for high-frequency (HF) and low-frequency (LF) words from all trials with a 500-ms delay. The left panel shows results from standard trials, and the right panel shows catch trials. In both panels, the dots reflect average naming response times, with values shown in the inset boxes. The trial periods corresponding to word presentation and tone onset are indicated by vertical lines

Discussion

In the present study, we assessed the relative cognitive demands of naming LF and HF words. By coupling a modified delayed naming procedure with pupillometry, we examined both standard (i.e., naming latencies) and novel (i.e., pupillary reflexes) manifestations of the word frequency effect. In the majority (80%) of experimental trials, participants issued standard delayed naming responses, following the perception of HF or LF words. In the remaining trials, they had to abandon the speech plan, saying “blah” instead, thereby equating speech production across HF and LF words. Our results generally replicated and extended those from Goldinger et al. (1997), although some differences were observed. Considering first the similar findings, we verified that standard naming is faster than switching to the “blah” response. Second, across all delay periods, we observed slower standard naming latencies for LF than for HF words. Although we did not observe frequency effects in the catch trials, the pupillary results support and extend those from Goldinger et al. (1997): Using pupil dilation as a proxy for cognitive effort (Kahneman, 1973), we observed that preparing to speak LF words, relative to HF words, demanded greater cognitive resources, a difference that persisted even after speech programs had been executed. The results suggest that lexical processes can be examined by monitoring pupillary reflexes (Kuchinke et al., 2007), potentially revealing differences in cognitive demands, even in cases with equivalent overt performance (Karatekin et al., 2004).

Although the pupillary results support earlier findings, it is important to note that two results from Goldinger et al. (1997) were not replicated in the present RT data. First, in the standard naming trials, the effect of delay was small. Whereas responses typically become faster with longer naming delays (Balota & Chumbley, 1985; Goldinger et al., 1997; Kawamoto, Liu, Mura, & Sanchez, 2008), we observed only a 20-ms effect. In contrast, the catch trials showed a robust delay effect. Together, these results suggest that participants may have been anticipating switch (“blah”) trials to an unreasonable degree. In a result of greater relevance to the present article, although we observed a reliable (albeit small) frequency effect in the standard delayed naming trials, we found no such effect in the catch trials, and thus failed to replicate this result from Goldinger et al. (1997).

We have two main hypotheses regarding the differences across the studies. Our primary hypothesis is that the present stimulus items were chosen in a manner that maximally preserved visual features across matched LF and HF words, but also minimized phonetic differences. Many pairs contained similar sound patterns (e.g., drawn/drown, mouse/moose, sweet/sweep), and some were fully homophonic (e.g., course/coarse, pain/pane, read/reed). We selected items with strict criteria in order to avoid unwanted pupillary reflexes, and we verified (in a pilot study) that they produced a reliable frequency effect in standard naming. Although the executed motor plan was always the same in catch trials, the initial speech motor plan (i.e., speaking the printed word) was left free to vary, although perhaps not enough to elicit a difference in task-switching time. Once word perception was complete in delayed naming, it may have been that the speech–motor programs were too similar across LF and HF words to reliably elicit a difference.

Our second hypothesis regards the pace of the experiment—in particular, its overall slow RTs. In the present experiment, the average RT in standard delayed naming was approximately 780 ms; analogous RTs in Goldinger et al.’s (1997) study were approximately 600 ms. Once again, this difference may reflect the phonetic composition of the stimulus materials. The present experiment included many items with “soft” initial phonemes, which are well known to affect voice-key responses (Kawamoto et al., 2008; Kessler, Treiman, & Mullennix, 2002). It is also possible that participants responded slowly because of the pace of the experiment. In order to allow pupils adequate time to return to baseline dilation, we introduced long ITIs (from 4 to 6 s) that may have slowed performance. The present experiment lacked the rhythm of most naming experiments, which has been shown to strongly affect RTs (see Kello, 2004). Our procedure also involved a forced choice response (standard/catch) after the response tone, whereas most delayed naming procedures merely involve naming. By introducing an extra task demand, we may have disrupted the normal rhythm of delayed naming, thereby slowing naming latencies and obscuring RT-based frequency effects. Whether either hypothesis is correct, it is important to emphasize that RTs represented only one dependent measure. Pupil dilation, measured in those same trials, revealed effects that conceptually replicated those from Goldinger et al. (1997), suggesting that the lower-frequency words imposed greater cognitive load.Footnote 5

As noted earlier, word frequency effects are ubiquitous in perceptual identification (Goldiamond & Hawkins, 1958; Manelis, 1977), lexical decision (Whaley, 1978), silent reading (Inhoff & Rayner, 1986; Rayner & Duffy, 1986), recognition memory (Glanzer & Adams, 1985), and other domains. Despite the theoretical importance of word frequency, debate surrounds its locus in the stream of cognitive processing. For example, although word frequency clearly affects perception (O’Malley, Reynolds, Stolz, & Besner, 2008; but see McCann & Besner, 1987; Paap & Johansen, 1994), interactions between word frequency and tasks (e.g., lexical decision vs. naming) suggest that the decision stage is also frequency-sensitive. To examine decision-free frequency effects, researchers may use the delayed naming procedure, wherein naming occurs once lexical access is complete. If word frequency exerts an influence solely during the perception and decision stages, it should disappear in delayed naming; however, if postaccess processes are frequency-sensitive, the differences should persist. We observed the latter result, suggesting that frequency modulates the cognitive effort required for speech planning (Goldinger et al., 1997; see also Herdman, 1992).

With the exception of event-related potential (Dambacher, Kliegle, Hofmann, & Jacobs, 2006; Hauk & Pulvermüller, 2004; Rugg, 1990) and fMRI (Chee, Westphal, Goh, Graham, & Song, 2003; Kronbichler et al., 2004) investigations, frequency effects are typically observed as overt behavioral differences across tasks, such as naming times or fixation durations. Although behavioral measures clearly identify frequency-based differences in lexical processing, they may mask dynamic attentional differences that persist after the offset of a behavioral response. In our study, we examined naming times but also estimated cognitive demands via pupillometry. As noted earlier, researchers have long examined the pupillary reflex to study cognitive processes, including lexical decision (Kuckinke et al., 2007), visual search (Porter et al., 2007), attention and concentration (Bradshaw, 1968), imagery (Paivio & Simpson, 1966), and memory (Papesh & Goldinger, 2011; Papesh, Goldinger, & Hout, 2012). In such studies, the pupils begin to dilate within 200–300 ms following the onset of a stimulus for cognitive processing. Constriction then follows the offset of cognitive demand, on a time course similar to that of dilation. For example, in a short-term memory task in which participants held digits in working memory prior to a “recall” signal, pupils reached maximum dilation during a preresponse pause, and then gradually constricted with each recalled digit (Kahneman & Beatty, 1966).

In the present study, participants’ pupil diameters increased sharply during the response preparation phase (the period directly following the response tone); this increase was consistently higher for LF than for HF words. Because this occurred in both standard and catch trials, it suggests that, regardless of the speech–motor plan, LF words demanded greater cognitive resources during processing. In fact, the frequency effect was numerically stronger during catch trials, wherein participants issued the same speech responses, regardless of the perceived word. Although the motor program for catch trials was repeated often throughout the experiment, thus producing smaller peak differences from baseline, such trials yielded large pupillary frequency effects: Planning to speak LF words, relative to HF words, demanded greater cognitive resources. (Notably, the numerically larger effect for catch trials may have been an artifact of having fewer observations, by design.) Furthermore, even after the speech–motor plan had been executed, LF words continued to exert an influence over the availability of cognitive resources. As can be seen in Figs. 3 and 4, we observed persistent frequency effects in the postresponse phase of each trial. This phase lasted at least 4,000 ms, which is sufficient time to allow the pupils to constrict following a naming response (Kahneman & Beatty, 1966). We thus observed a lasting effect of word frequency, continuing even as participants prepared for the subsequent trial. Indeed, as we noted in note 3, pilot testing for this experiment showed that the word frequency of trial n significantly influenced the starting diameter during trial n + 1, forcing us to use a trial-by-trial baseline correction procedure on our pupil data. (Note that, in the righthand panel of Fig. 4, the uncorrected functions show a frequency-based diameter difference prior to word onset.) LF words demand more cognitive resources during processing, leaving a diminished “pool” of available resources after they are processed. Although this hypothesis requires further testing, it suggests that higher-level cognitive operations, such as reasoning or text comprehension, may suffer persistent lags when a reader encounters LF words with minimal supporting context.

In summary, the present results suggest that word frequency affects postlexical access processes, such that LF words demand greater attention in speech planning and beyond. By monitoring pupil diameters as participants perceived and prepared to speak words of varying frequencies, we observed that the cognitive demand imposed by LF words reliably exceeded that of HF words. This difference in demand occurred over an extended time course, including the postspeech “recovery” time. Current theories of frequency effects in word perception typically focus on discrete stages, from perception through behavioral output. Analogous to “spillover” effects in reading (Rayner & Duffy, 1986; Reichle, Pollatsek, & Rayner, 2006), our results suggest that frequency may continue to affect cognitive processing after perception, lexical access, and naming are complete. The results also suggest that pupillometry offers a reliable method to estimate the cognitive demand imposed by different variables in word recognition.