In this experiment, we aimed at testing how quickly eye movements reflect the use of auditory and visual information in speech perception. At this juncture it is useful to distinguish the use of the visual speech signal from audiovisual integration. If participants look more toward the picture of a labial referent when seeing a lip-closing speech gesture, this indicates that visual speech is used, but it does not necessarily mean that the visual and the auditory signal are integrated to a multisensory percept.
We presented participants with an audiovisual speech stimulus and asked them to click on the picture referred to by the speaker. By using congruent and incongruent audiovisual stimuli, we could estimate when the visual and the auditory signal were used for speech perception. Because a previous study indicated that initial lexical access might be strongly biased toward the auditory signal (Ostrand et al. 2016), we used slightly ambiguous auditory stimuli in an attempt to provide more leverage for the visual signal to influence perception. We decided to aim for stimuli with an odds ratio of 4 (80/20), which should still give rise to a clear effect of the auditory cues but at the same time allow for a visual influence.
Method
Participants
Fourteen native speakers of German took part in the experiment. They all had normal or corrected-to-normal vision and no hearing impairment. They were paid for their participation.
Materials and procedure
We recorded a German native speaker producing eleven German minimal word pairs, of which 10 were eventually used. The minimal pairs differed in the place of articulation of stop consonants, which were either labial or alveolar (e.g., Panne–Tanne, Engl. (car) breakdown–fir; Table 2 in the Appendix provides the list of minimal pairs used in this experiment). Half of the minimal pairs differed in word-initial position (as in Panne–Tanne), the other half differed in word-final position (e.g., gelb–Geld, Engl., yellow–money).Footnote 1
Two recordings were made for each of the minimal pairs, one audiovisual recording and one high-quality audio-only recording with the microphone close to the speaker’s mouth (for details on the audio recordings see below). The video was recorded at 25 frames per second and focused on the speaker’s head (720 × 576 pixels per frame). From this video recording, short clips of 1.2 s were extracted. The audio in these clips started between frame 12 and 13 (i.e., after 480–520 ms). The first 200 ms were overlaid with a fade-in from a black frame up to the still of the sixth frame and the last 200 ms were overlaid with a fade-out from a still of the 25th frame to a black frame. These transitions were added using Adobe Premiere (Adobe Systems Inc.). The videos were then further cropped to a size of 350 × 496 pixel using the VirtualDub software (www.virtualdub.org).
Multiple high-quality audio-only recordings were made for each word. The one matching closest with the audio from the video was then selected for the generation of audio continua. Discrepancies between the video and chosen high-quality audio recording were below 30 ms.Footnote 2 To select audio tokens that were slightly ambiguous between the labial and alveolar endpoints of the minimal pairs, the selected audio recordings were morphed into 11-step continua using the STRAIGHT audio morphing algorithms (Kawahara, Masuda-Katsuse, & de Cheveigné, 1999).
The continua were pretested by asking nine native speakers of German to categorize six steps for each of the 11 minimal pairs (Steps 1, 3, 4, 5, 6, and 8) five times. Based on this pretest we selected audio files that elicited roughly 20% and 80% of labial responses. Because all pairs gave rise to clear identification functions, we decided to exclude the pair Korb–Cord (Engl., basket–cord), because Cord is mostly used in compound words and rarely used in isolation in German. For the remaining 10 pairs, these audio files were then dubbed onto the videos by replacing the original audio (where the original audio was used to time align the new audio). In this way we produced videos in which the auditory cues more or less matched the visual cues (e.g., a video with the utterance Tanne and an audio identified as Tanne at 80%) and videos in which the auditory cues mismatched the visual cues (i.e., a video with the utterance Tanne and an audio identified as Tanne at 20%).
For each of the 20 words from the 10 minimal pairs, we performed a Google image search and selected an image to represent that word (see Fig. 1 for some examples). Images were scaled to 200 × 200 pixels. Images and videos were combined to displays in two different conditions. The first condition was similar to the typical visual-world paradigm, in which, on each trial, four different pictures were presented on the screen (visual-world condition). The video of the speaker appeared centered on the screen, and the pictures were placed to the right and left of the video. Pictures included a target (e.g., Panne), its competitor (Tanne), and a distractor pair that had the critical sound in different word position (here e.g., gelb–Geld; where the critical contrast is in word-final position). The distance between the center of the screen and the middle of the pictures was about 9° horizontally and about 7.5° vertically. The images on the screen changed for every trial (as is usual in the visual-world paradigm) and the randomization was done independently for each participant, with the constraint that the target and its competitor appeared equally often in all four possible positions. The pictures were presented 500 ms prior to the onset of the video, so that, in total, the preview of the pictures was 1 s relative to the start of the audio. Participants were instructed to move a visible mouse cursor on the picture matching the word they heard and click on it.
The second condition was similar to the typical format of audiovisual speech perception studies with little trial-to-trial variation (minimal-variability condition). Each minimal pair was repeated for 12 trials, and only the two response options for this pair were displayed on the screen. The two pictures were presented in the upper right and upper left position (see Fig. 1 for the exact positions) and did not switch sides for a set of 12 stimuli. The timing was the same as in the visual-world condition. Participants were asked to press the left mouse button if the utterance matched the picture on the left of the speaker and the right mouse button if the utterance matched the picture on the right of the speaker. Hence, no mouse cursor was visible and no mouse movement was required, following the standard procedure in audiovisual speech experiments. The experimental procedure was implemented using the SR Research Experiment Builder.
Participants were first familiarized with the pictures and their names. Next, they were seated in front of a computer screen and an Eyelink SR 1000 eye tracker in desktop set-up was calibrated. They were instructed that they would see a speaker in the center, two or four pictures scattered over the four quadrants of the computer screen, and hear a word over headphones. They were asked to decide which word they thought the speaker in the video had uttered. How the response was given differed between the two conditions. In the visual-world condition, participants were asked to move a visible mouse cursor over the picture matching the utterance and click on it. In the minimal-variability condition, there was no mouse cursor visible, and participants simply clicked the left mouse button if they thought the utterance matched the picture on the left and the right mouse button if the utterance matched the picture of the right. Condition (visual world vs. minimal variability) was manipulated within participants. It was blocked with the order counterbalanced across participants. Each condition contained 120 trials.
For each minimal pair, there were four possible audiovisual stimuli that arose by crossing the two auditory stimuli with the two visual stimuli. Because our main question about the timing of the use of auditory and visual information was addressed via the incongruent trials, these were presented twice as often as the congruent ones. Hence, each participant saw—for each of the 10 minimal pairs—the two congruent stimuli (i.e., audio & video = labial and audio & video = alveolar) twice (= four trials) and the two incongruent trials four times (= eight trials).
In the visual-world condition, the stimuli were presented randomly, with the constraint that the same pair could not be used as a target on two consecutive trials. In the minimal-variability condition, the 12 stimuli for one minimal pair were presented consecutively, and the transition from one to the next minimal pair was indicated by a screen that showed the two pictures for the upcoming minimal pair. It informed participants that they would have to make a choice between these two for the next set of trials. In both conditions, there was a break halfway through the 120 trials; participants continued by pressing a mouse button.
Data processing
The output of the eye tracker in terms of events (saccades, fixations, and blinks) was analyzed with a PERL script to generate a timeline of looks for each trial. Saccades and Fixations were merged into looks to the position of the fixation (cf. McMurray et al., 2008). Blinks were replaced with the last preceding fixation position. If there as a fixation position outside the screen on a given trial, the data from this trial were discarded, as such fixation positions often indicate faulty eye tracking. This led to the rejection of 181 trials (4.8%), of which 117 were from one participant, who was difficult to calibrate due to wearing glasses. The data from this participant was not used for the eye-tracking analysis. For the rest of the participants, the rejection rate was 2.1%.
For the remaining trials, the fixation positions were categorized as being on the face, on other parts of the video, or on one of the pictures. Fixations were only counted as on a picture, if the fixation position was on a pixel that was occupied by that picture. This was necessary because the video and the picture were quite close to each other. Similarly, fixations were counted as being on the face if they were within a rectangular region between the eyelashes and the chin in the vertical dimension and between the outer edges of the eye sockets in the horizontal dimensions. Previous research has indicated that the use of visual speech cues is invariably strong when this region is fixated (see Paré et al., 2003, Experiments 1 and 2).
Results
Overall gaze patterns
First of all, we provide an overview where participants looked during the different tasks. This provides a frame of reference for interpreting the behavioral data with regard to how often participants fixated on the face of the speaker versus on other parts of the visual display. Gaze patterns are shown separately for minimal pairs differing in the initial phoneme and minimal pairs differing in their final phoneme, because the timing of visual versus auditory information is different between these two cases. With stop consonants, the visual cues precede the auditory cues in the word-initial case (because the closing gesture precedes the release with no acoustic trace), but the visual cues and auditory cues go hand in hand in the final condition, in which the visual closing gesture also leads to audible formant transitions. For all eye-tracking data, we used the time of the consonant release as the time anchor, indicated as zero milliseconds in all eye-tracking figures.
Figure 2 shows that participants fixated mostly on the faceFootnote 3 up to the point of the release of the critical stop consonant. Only in the visual-world condition, and there only for the stimuli with the critical consonant in word-final position, participants started moving their eye gaze away from the video toward the pictures already around the release. That is, at the time of the release there were about 50% fixations on the video, and that trend was rapidly falling. However, 50% fixations on the video should be sufficient to expect an influence of the visual signal on the perceptual identifications.
Perceptual identifications
Figure 3 shows the proportion of cases in which the stimuli were identified as labial, that is, responses in which the labial member of the minimal pair was chosen for all combinations of auditory and visual cues. The left panel shows that in the minimal-variability condition, both auditory and visual cues influence the likelihood of labial responses. The data for the visual-world condition (right panel), in contrast, indicate that only the auditory cues mattered. This difference was confirmed by a generalized linear mixed-effects model using the package lme4 (V.1.1.10) in R (V.3.2.5). In this analysis, response (0 = alveolar, 1 = labial) was the dependent variable, and the fixed-effect predictors were visual and auditory cue (contrast coded as -0.5 = alveolar, 0.5 = labial), condition (contrast coded as -0.5 = minimal-variability and +0.5 = visual-world), and the two-way interaction of cues and condition.
We did not specify a full-factorial model, because smaller models require fewer random effects and as a consequence are less likely to lead to convergence problems. Participant and item (i.e., video file) were entered as random effects, with a maximal random effects structure (Barr, Levy, Scheepers, & Tily, 2013). The analysis gave rise to a significant effect of auditory cue (b = 4.352, SE = 0.636, z = 6.842, p < .001) that was marginally moderated by condition (b = 1.435, SE = 0.805, z = 1.783, p = .074) and an effect of visual cue (b = 2.392, SE = 0.572, z = 4.184, p < .001) that was strongly affected by condition (b = -4.667, SE = 0.7458, z = -6.261, p < .001). To further investigate this interaction, we ran separate models for both conditions, always using a maximal-random effect structure. In the minimal-variability condition, there was an effect of auditory cue (b = 3.443, SE = 0.641, z = 5.373, p < .001) and visual cue (b = 4.451, SE = 0.641, z = 6.944, p < .001). In contrast, in the visual-world condition, there only was a significant effect of auditory cue (b = 5.381, SE = 0.822, z = 6.545, p < .001) but not of visual cue (b = 0.067, SE = 0.748, z = 0.090, p = .928).
Given that the null effect of the visual speech cue was unexpected, we further investigated this in two ways. First of all, we calculated a visual effect measure for each participant by subtracting the logOdds of labial responses given an alveolar visual speech cue from the logOdds of labial responses given a labial visual speech cue, and ran a Bayesian one-sample t test on these data. For this analysis we used the function ttestBF from the R package BayesFactor (Version 0.9.12) with its default priors. This produces as test statistic a Bayes factor (Rouder, Speckman, Sun, Morey, & Iverson, 2009), which provides evidence for the null hypothesis if below one third. The Bayes factor for the visual effect in the visual-world condition was 0.211 and hence provides evidence for the hypothesis that the visual cue is not used in the visual-world condition.
Additionally, we used the eye-tracking record to focus on trials in which the face of the speaker was fixated for more than 90% of the 200-ms interval around the closure release. This was the case for about 59% of the trials (894 out of 1,521 trials with valid responses and good eye tracking). An additional 11 trials from one participant were rejected, because this participant looked away from the face on the majority of trials and had no data for one cell of the design. For this subset, the effect of visual speech cue was also small (1%) and not significant (b = 0.111, SE = 1.071, z = 0.104, p > .2). A Bayesian t test on the individual measures of the visual speech cue (defined as above) provided a Bayes factor of 0.327, which is again evidence for the assumption that there is no effect of the visual speech cue, even when the participants focused on the face during the release of the critical consonant.
Time course of auditory- and visual-cues effects
The main rationale of this experiment was to track the time course of the effect of auditory and visual speech cues on fixations on referent pictures in the visual-world condition. The effectiveness of a cue is reflected in more looks to labial versus alveolar referent pictures when audio or video indicate a labial rather than an alveolar consonant. With eye tracking, we can see when these effects start emerging. The relevant data are displayed in Fig. 4 and show a clear difference between lines depending on the auditory cues but no clear difference between lines differing only in visual cues. The effects of the auditory cues arise between 200 and 300 ms after the stop release. This can be seen by comparing the solid versus dashed lines for the word-final condition and the dotted versus dashed-dotted lines for the word-initial condition. Additionally, the figure shows an overall preference for the labial interpretation at the onset of the stop release for the stimuli with word-final stops. For these conditions there are very few looks to the alveolar pictures before about 300 ms after the stop release.
To statistically test the time course of effects in such data, two methods have been previously used. First, following methods used in electrophysiology (see, e.g., van Turennout, Hagoort, & Brown, 1998), moving time windows are used to establish in which time window the effect first reaches significanceFootnote 4 (Altmann, 2011; Mitterer & Reinisch, 2013; Salverda, Kleinschmidt, & Tanenhaus, 2014). Second, a jackknife method has been used to estimate when an effect reaches a certain percentage of its maximum (see, e.g., McMurray et al., 2008). The latter method has the advantage to be insensitive to effect size, and is hence preferable when effects differ in size. However, this method cannot be applied to the current data, because it requires an effect to be present with a clearly defined maximum. This is not the case for the visual cues.
It is conceivable, however, that there might be at least a transitory effect of the visual cues. To test whether this is the case, we used the moving window method and ran linear mixed-effects models on sequences of a moving 200-ms time window starting 100 ms before and leading up to 900 ms after the stop release, with the center of the window being shifted in steps of 100 ms. For each of these time windows, we calculated the preference for fixating on the picture of the referent with the labial by subtracting the logOdds of the fixation proportion for the alveolar picture from the logOdds of the fixation proportion for the labial picture (proportion of zero and one were replaced by 1/n * 2 and (2 * n -1)/n, as recommended by Macmillan and Creelman, 2004). This measure was used as the dependent variable and the predictors were auditory and visual cue (again contrast coded so that a labial cue is coded as positive).
Figure 5 shows the outcome of the analysis. Fixation proportions for the labial picture were significantly influenced by the auditory cue from about 300 ms after the stop release. However, there was no discernable effect of visual cue. Previously we noted that there was an overall preference for a labial interpretation for the word-final minimal pairs. In this time-course analysis, the dependent variable was the fixation proportion to labial minus the fixation proportion to alveolar pictures. In the statistical analysis, the preference for a labial interpretation was hence reflected in a significantly positive intercept. This intercept, indicating an overall preference for labial referents was significantly larger than zero in the time windows from zero to 500 ms after stimulus onset (not displayed in the figure to prevent clutter).
Discussion
The aim of this experiment was to test the time course of the use of visual and auditory cues in a visual-world paradigm and to compare these effects to a setting that closely mimics typical experiments on the use of audiovisual information in phonetic categorization (i.e., minimal-variability condition). The minimal-variability condition served as a control that the videos contain visual cues to the place of articulation of the critical consonants and showed that this was the case. When participants saw a labial closure on the video, they were more likely to perceive the corresponding word as containing a labial. Or stated more simply, our stimuli gave rise to a McGurk effect.
However, this McGurk effect disappeared in the visual-world condition in which participants had to click on one of four target pictures, which appeared 1 second before the onset of the speech stimulus, but differed from trial-to-trial in their position. The eye-tracking data were used to show that this was also the case when participants focused on the speaker during the critical consonant release. It is also important to note that the visual angle between fixation and center of the screen with our display rarely went beyond 10°, a distance which hardly affects the McGurk effect (Paré et al., 2003).
Hence, it seems that the processing of the visual display—largely independent of eye gaze—interferes with the processing of the visual speech. This extends the literature which shows that the McGurk effect—though resilient to variation in visual acuity—is rather vulnerable when there is an additional visual load (Alsius et al., 2005; Navarra et al., 2010). Our results showed that this is the case even when the stimuli are not concurrently appearing with the visual speech (note that there was a 1-s preview of the pictures before the onset of speech) or overlapping spatially with the visual speech.
It is possible that the use of visual speech cues would have been stronger if the auditory stimuli had been more ambiguous. Note, however, that the McGurk effect—and, indeed, the majority of studies on audiovisual speech processing—has relied on unambiguous auditory stimuli. As such, strong ambiguity of the auditory information does not seem to be necessary for effects of visual speech to occur. By using slightly ambiguous auditory stimuli, we already provided a better situation for the visual signal to have some leverage over the final percept than most studies on audiovisual integration in speech perception.
There is one oddity to discuss in our results. As pointed out in the “Results” section, participants had an overall preference for a labial interpretation for the word-final minimal pairs independent of the experimental condition. It is important to note that for these stimuli, there were two temporally separated cues that listeners could use in pairs, such as Kalb–kalt (Engl., calf–cold). First, there is the formant transition into the stop closure, which is then followed by the release burst (note that stops in German are canonically released). The auditory stimuli were selected based on a pretest and were identified as labial in 20% or 80% of the cases. In doing so, we apparently selected items in which the formant transition was biased toward a labial interpretation because, for released stops, the final percept is mostly determined by the burst (see, e.g., Dahan & Tanenhaus, 2004). Therefore, the stimulus with an 80% alveolar interpretation based on the combination of transition and release still had a formant transition that was biased toward a labial interpretation. This may be because the alveolar release burst is typically louder than a labial release burst, so the mix requires a good deal of the labial release burst to be perceptually ambiguous. This result shows that our eye-tracking data are sensitive enough to reflect the online processing of such fine phonetic detail in a highly time-locked fashion. The fact that even with this measure no effect of visual speech was observed strengthens the argument that the visual speech information is not used under a visual load.
While the data clearly show that visual cues are used in the minimal-variability but not in the visual-world condition, it is difficult to say what caused this difference. The two conditions differ in many respects, such as the variability of the visual environment, response format, and so on. We will return to this issue in the General Discussion, because the data of Experiment 2 provide further constraints on when visual speech cues are used.