Reconsidering the role of temporal order in spoken word recognition
Models of spoken word recognition assume that words are represented as sequences of phonemes. We evaluated this assumption by examining phonemic anadromes, words that share the same phonemes but differ in their order (e.g., sub and bus). Using the visual-world paradigm, we found that listeners show more fixations to anadromes (e.g., sub when bus is the target) than to unrelated words (well) and to words that share the same vowel but not the same set of phonemes (sun). This contrasts with the predictions of existing models and suggests that words are not defined as strict sequences of phonemes.
KeywordsSpoken word recognition Speech perception Temporal order Eyetracking Visual-world paradigm
A significant problem in understanding spoken word recognition is that speech unfolds over time. This leads to temporary ambiguity: Information early in the signal is often insufficient to identify the intended word, since it is consistent with many different words (Marslen-Wilson, 1987). For example, when hearing tack, after only /tæ/ is heard, many completions are possible (e.g., tack, tap, taxi). A related issue involves temporal order: The order of elements in a word seems to be important for distinguishing them. For example, sub and bus consist of the same phonemes but can be distinguished because those phonemes occur in different sequences. There has been little research on the effects of temporal order, despite the assumption that it is fundamental to lexical representations, and the fact that it is implemented quite differently in different models (Gaskell & Marslen-Wilson, 1997; Grossberg, 2003; McClelland & Elman, 1986; Norris & McQueen, 2008). In the present study, we examined its role in spoken word recognition by asking whether phonemes in incorrect positions contribute to lexical access.
Previous studies on temporary ambiguity are relevant to this problem. These studies have led to a consensus that listeners access potential lexical candidates from the earliest moments of a word (Allopenna, Magnuson, & Tanenhaus, 1998; Marslen-Wilson & Zwitserlood, 1989); they consider multiple words in parallel (Luce & Pisoni, 1998; Marslen-Wilson, 1987); they update the words under consideration as subsequent information arrives (Dahan & Gaskell, 2007; Frauenfelder, Scholten, & Content, 2001); and words compete with each other for recognition (Dahan, Magnuson, Tanenhaus, & Hogan, 2001; Luce & Pisoni, 1998).
This work also makes the strong assumption that words are defined as phoneme sequences.1 This seems intuitive: If phoneme order distinguishes words, it makes sense for the system to represent them in this way. The large body of research showing strong competition from onset-matching competitors (cohorts) provided some of the earliest evidence for this assumption. Since cohorts are more active than other competitors during early portions of the signal, this result appears to support slot-based representations in which early parts of the signal are matched to early parts of words in the lexicon. Indeed, the COHORT model (Marslen-Wilson, 1987) suggested that recognition was all-or-none: Words that did not match at onset did not compete for activation. However, empirical work showing competition from onset-mismatching words has challenged this assumption (Allopenna et al., 1998; Connine, Blasko, & Titone, 1993). As a consequence, current models allow activation to depend on the degree of match in each phoneme position, but continue to use a slot-based scheme (Gaskell & Marslen-Wilson, 1997; Luce, Goldinger, Auer, & Vitevitch, 2000; Luce & Pisoni, 1998; McClelland & Elman, 1986; Norris & McQueen, 2008). In NAM and Shortlist B, for example, graded activation is implemented by using phoneme confusion data to partially activate words that are phonologically similar within each slot. In TRACE, this is accomplished through the use of graded features for each phoneme. However, while current models allow for this graded activation, they all incorporate the idea that the input is matched to lexical items slot-by-slot.
In the present study, we challenged this assumption, asking whether words are even represented as sequences. This research was loosely inspired by work on visual word recognition suggesting that letter order may be only coarsely coded (Chambers, 1979; Grainger & Whitney, 2004). Nonwords with transposed letters (JUGDE) prime their original words (JUDGE), while the same-sized mismatch without transposition (JULPE) does not (Perea & Lupker, 2003); this also extends to nonadjacent transpositions (e.g., CANISO/CASINO: Perea & Lupker, 2004). This implies that letters in incorrect positions may still activate the correct target word. This does not mean that readers completely ignore letter order, and there are limitations to these effects (Guerrara & Forster, 2008; Hannagan, Dupoux, & Christophe, 2011). However, the effect suggests that printed words are not represented as strict letter sequences. While spoken and written words differ in both temporal demands and the nature of the input (the forms of letters are the same in all positions, while the forms of phonemes are not), this raises the possibility that order need not be fully represented.
Abandoning a slot-based approach for spoken words may appear to create problems (e.g., distinguishing sub from bus). However, given that phonemes vary acoustically with word position and that listeners are sensitive to fine-grained differences (McMurray, Tanenhaus, & Aslin, 2002), people may not need to represent words in a slot-based format: Sub and bus may be distinguished because word-initial /b/ is acoustically different from word-final /b/. Under such a system, listeners might show strong cohort effects, since when they hear a word-initial /b/, its acoustic properties map more strongly onto words with the word-initial allophone of /b/, and less well onto words with the word-final allophone. Critically, these effects would not arise from slot-based representations.
We tested this possibility by examining phonemic anadromes: words like sub and bus that contain the same phonemes in the opposite order. Since many models tolerate some mismatch at onset (Gaskell & Marslen-Wilson, 1997; Luce et al., 2000; Luce & Pisoni, 1998; McClelland & Elman, 1986; Norris & McQueen, 2008), they might predict activation for anadromes over unrelated words that share no phonological features (on the basis of the shared vowel). However, they predict that this should be the same as activation for nonanadromes that have a similar degree of mismatch (e.g., sun would compete with bus just as well as sub does). None predict competition from anadromes due to phonemes in incorrect positions (e.g., the /b/ in sub leading to activation for bus). We have confirmed that this is the case for TRACE (see the supplemental materials, section S1). Thus, existing models suggest that sub might compete weakly with bus because of the similarity of the words within each slot, since the matching vowels (in the second slot) could drive some activation (though it has never been empirically demonstrated that a single word-medial phoneme can drive competition). Given this, we must also compare anadromes (sub when bus is the target) to words having the same onset and vowel (sun).
Subphonemic overlap among the consonants could also play a role. While sub and bus have little overlap, consider tack and cat. In TRACE, the features for /t/ are similar to those for /k/. In NAM and Shortlist B, the /t/ would be partially confusable with /k/. Therefore, any word with some confusability in each slot could compete. Thus, tap could compete with cat, and if /p/ and /k/ are equally distant from /t/, tap and tack could be equally active. Critically, this activation still derives from slot-based representations; phonemes or features in the wrong slot do not drive the effect. Thus, to establish anadrome activation, we would also need to look at word pairs containing initial consonants with minimal phonological overlap.
We used the visual-world paradigm (Allopenna et al., 1998; Tanenhaus, Spivey-Knowlton, Eberhard, & Sedivy, 1995) to ask whether anadromes are activated during lexical access, and if so, whether part of this activation can be attributed to phonemes whose positions in the input do not match the corresponding positions in the competitor word. Participants heard a target word in the presence of four pictures representing the target, possible lexical competitors, and unrelated words. Each item set contained a base word (e.g., sub), its anadrome (bus), a cohort (sun), and an unrelated word (well). Critically, on some trials, anadromes could be compared to words that overlapped only in their vowel (e.g., when bus was the target, differences in looks to sub and sun), allowing a test of effects of the vowel and of partially matching initial consonants. While ideally the word list would include only anadromes with minimal overlap among the bracketing consonants, sufficient picturable pairs were not available in English to construct a large list. Thus, our list consisted of words with similar initial consonants (e.g., tack/cat), which were useful for establishing activation for anadromes, and words that were minimally overlapping (sub/bus), which could confirm that effects were driven by phonemes in different positions.
A group of 30 University of Iowa undergraduates participated. The participants were native English speakers, reported normal or corrected-to-normal vision, provided informed consent, and received course credit or monetary compensation.
Participants heard spoken words and used a computer mouse to select a corresponding picture. The stimuli consisted of 16 sets of four words (see the Appendix). Each set contained a base word (e.g., sub), its anadrome (bus), a cohort (sun), and an unrelated word (well).
Example item set, by trial types
Auditory stimuli were recorded by a female talker in a sound-attenuated room. Recordings were made at 44.1 kHz on a Kay Elemetrics Computerized Speech Lab 4300B. Several tokens of each word were recorded, and the best exemplar was selected. The mean word duration was 457 ms (SD = 74 ms).
The visual stimuli were color drawings prepared using a standard procedure in the McMurray lab. For each word, a set of candidate clip-art images were obtained. These were viewed by focus groups of lab personnel to select the most prototypical image and to guide subsequent editing to make pictures representative and uniform (McMurray, Samelson, Lee, & Tomblin, 2010).
An SR Research EyeLink II eyetracker was calibrated using the standard nine-point procedure. On each trial, participants saw four 200 × 200 pixel pictures in each corner of a 19-in. CRT monitor. At the center, a small blue dot was displayed, which turned red after 750 ms. The participant then clicked on the dot and heard the auditory stimulus 100 ms later (via Sennheiser HD555 headphones). This ensured that both the mouse cursor and the participant’s gaze were centered when the auditory stimulus began. The trial ended when the participant clicked on the corresponding referent.
Every 45 trials, drift correction was performed, and participants were given the opportunity to take a break. Stimulus presentation and data collection were handled by the SR Research Experiment Builder package and by the EyeLink control software.
As in prior experiments (McMurray et al., 2010; McMurray et al., 2002), the eye-movement record was parsed into saccades and fixations by the EyeLink software. Adjacent saccades and fixations were combined into a “look” extending from saccade onset to fixation offset. The boundaries around the images were extended by 100 pixels. This maintained substantial space between ports (horizontal, 630 pixels; vertical, 374 pixels) while compensating for noise.
We used a standard area-under-the-curve approach for the data analysis (Allopenna et al., 1998; McMurray et al., 2002), with the time window starting 200 ms after stimulus onset (since it takes ~200 ms to plan an eye movement) and ending at 1,417 ms (the average response time [RT] plus 200 ms).2 Proportions of looks were analyzed with linear mixed-effects (LME) models using the lme4 package in R (Bates & Sarkar, 2011). The dependent variable was proportion of looks, transformed with the empirical logit for use with linear models. All models used object type as the only fixed effect, which was binary and coded as −.5/+.5.
In considering the random-effects structures of the models, we found that adding by-subject slopes did not significantly improve model fit, while by-item random slopes did. Models with only random intercepts for items showed clearer effects; models that were more sensitive to item-level variation using random slopes did not uniformly show effects. This suggests that our effects generalized robustly across subjects, but not necessarily across items. This was not unexpected, since our design does not provide sufficient power to detect small by-item effects, and the items shared a number of sources of variability. Thus, we report here results both from models with only random intercepts and from models with random slopes (see the supplemental materials, section S2, for complete details of all statistical models). To be clear, both models included random effects for subjects and items. Significance was evaluated using the chi-square goodness-of-fit test comparing models with and without object type as a fixed effect.
Participants performed well, averaging 99.3 % correct (SD = 1.1 %). We eliminated the few trials on which they selected the incorrect referent (M = 3.9 trials/participant, SD = 6.5). The mean RT for correct trials was 1,217 ms (SD = 642 ms).
Since words that mismatch at onset can also be activated (Allopenna et al., 1998), the anadrome effect could be driven by the overlapping vowel and/or by shared features at onset (e.g., the /k/ in cat shares features with the /t/ in tack, since both are voiceless stops). Such a model also predicts activation for tap (an overlap competitor) when cat is the target. We examined this with a second LME model, which compared looks to overlap and unrelated objects on cohort/overlap trials (Fig. 1, right). The proportions of looks to the two objects were not significantly different in either the random-intercept model (b = 0.001, SE = 0.006, p = .847) or the random-slope model (b = 0.001, SE = 0.006, p = .847). This suggests that overlaps were not activated more than unrelated items.
We next asked whether anadromes also received more looks than overlaps. Here, we considered two approaches to analyzing this difference. One follows the analyses above, which examined looks within a single trial type, holding the auditory stimulus constant and examining fixations to different visual objects in the display. An alternative would be to compare looks to the same visual object when it served as a different type of competitor. For example, we could compare looks to bus when it served as an anadrome on cohort/anadrome trials (with sub as the target) with looks to bus as an overlap on cohort/overlap trials (with sun as the target). This controlled for differences in the visual stimulus that might drive differences in eye movements, gave us twice as many trials from which to draw our data, and might make it easier to detect small competitor effects, since it used trials with a cohort in the display (Dahan et al., 2001).3
Therefore, a stronger test of our hypothesis would be to examine differences in anadrome and overlap looks for pairs whose consonants shared no features (e.g., for bus [target], sub [anadrome], and sun [overlap], the competitors’ initial phoneme differed from the target’s in voicing, place, and manner). For these item sets, activation for the overlap and anadrome would not be enhanced by shared features at onset. If we were to observe more activation for anadromes than for overlaps in these item sets, it would suggest that competition was driven by the presence of phonemes that matched the target but were in the wrong position.
A fourth LME model examined this possibility by comparing anadrome and overlap looks on anadrome/overlap trials for the four item sets that differed in all three types of phonological features (Fig. 2, right). The random-intercept model showed an effect of object type, with more looks to anadromes (b = 0.055, SE = 0.015, p < .001); this effect was not significant in the random-slope model (b = 0.055, SE = 0.040, p = .183). This suggests that our effects cannot be due solely to graded activation of phonemes in slot-based representations of words; phonemes in the incorrect position play an additional role.
We found robust evidence that anadromes are activated more than unrelated words during spoken word recognition. This is consistent with previous work showing activation for lexical competitors that mismatch at onset (e.g., rhymes), and adds to the set of competitors that listeners consider during word recognition.
There are two potential causes of this effect. One possibility is that words are represented in a slot-based way and that a matching phoneme in the second position (the vowel) and/or a mismatching, but similar, phoneme in the first position (the initial consonant) drove the effect. However, we only found mixed evidence that overlaps were activated at all, and we found much stronger evidence that anadromes received more fixations than overlaps when differences in the consonants were maximal. When overlap activation was not heightened by featural match between the overlap and the target, a clear anadrome effect could be seen.
This suggests that words are not represented in a slot-like format, and that phonemes in the wrong position may activate competitors (e.g., the /s/ in the first position of the input could activate bus). Thus, in contrast to existing models of word recognition, anadromes are activated as a consequence of phonemes in the incorrect position.
Given this, why don’t listeners confuse words with their anadromes? One possibility is that fine-grained phonetic detail could serve as a proxy for temporal order. The acoustic forms of consonants vary with syllable positions: A syllable-initial /b/ contains rising formants and a short voice onset time, while a syllable-final /b/ contains falling formants, a closure, and a release burst. While both have similar gross spectral shapes (Blumstein & Stevens, 1979), they differ in many details. As a result, a fine-grained description of sub will differ from that for bus, even if phoneme order is ignored. This can explain why cohorts are more active than other competitors: A word-initial allophone of /b/ is a better match to bus and bun than to sub. We should note, however, that this is only a hypothesis at this point.
The fact that word recognition is sensitive to fine-grained detail (McLennan & Luce, 2005; McMurray et al., 2002), including allophonic detail (Ranbom & Connine, 2007), means that listeners could use subtle acoustic differences to distinguish anadromes. At the same time, the coarse spectral similarity would lead to some parallel activation. Such a model is quite different from most models of spoken word recognition, in that lexical representations have no inherent temporal order. Rather, words would be more like a bundle of fine-grained acoustic cues. This detailed acoustic information is preserved in exemplar models (Goldinger, 1998), although the memory traces for whole words are typically thought to include temporal order.
Although this differs from current models, it could be implemented in similar architectures using a model in which fine-grained cues were mapped directly to words (independently of order). A version of TRACE with a single time alignment and a richer input might show anadrome effects, as could architectures like normalized recurrence (Spivey, 2007), which has been used to model phonetic categorization (McMurray & Spivey, 2000). These types of models could show classic online processing effects, while distinguishing anadrome pairs using fine-grained detail. In the meantime, these results challenge current models and the assumption that words are represented in terms of phoneme order, and they also challenge the field to develop creative alternatives to representing temporal order in models of word recognition.
We use phonemes here only as a convenience to describe the input; we are not implying any representational status for them in the recognition process.
For trials longer than 1,417 ms, looks were truncated. For trials shorter than 1,417 ms, the final fixation position was used to fill the remaining time period, assuming that listeners had settled on an interpretation when they clicked on the referent. This approach has been used in several previous studies (McMurray et al., 2010; McMurray et al., 2002).
For the anadrome/unrelated and overlap/unrelated comparisons, this required us to compare trials with a cohort to those without one. Thus, we initially examined the within-trial-type comparisons for those effects.
We thank Jennifer Cole, Gary Dell, Simon Fischer-Baum, and Deborah Gagnon for insightful comments, and Matt Goldrick and James McQueen for helpful comments on earlier drafts. We also thank Dan McEchron for assistance with the data collection and processing. This research was supported by a Beckman Institute postdoctoral fellowship to J.C.T., NIH Grant DC008089 to B.M., and NIH Grant HD044458 to Gary Dell.