Automatic audiovisual integration in speech perception
- First Online:
- Cite this article as:
- Gentilucci, M. & Cattaneo, L. Exp Brain Res (2005) 167: 66. doi:10.1007/s00221-005-0008-z
- 306 Views
Two experiments aimed to determine whether features of both the visual and acoustical inputs are always merged into the perceived representation of speech and whether this audiovisual integration is based on either cross-modal binding functions or on imitation. In a McGurk paradigm, observers were required to repeat aloud a string of phonemes uttered by an actor (acoustical presentation of phonemic string) whose mouth, in contrast, mimicked pronunciation of a different string (visual presentation). In a control experiment participants read the same printed strings of letters. This condition aimed to analyze the pattern of voice and the lip kinematics controlling for imitation. In the control experiment and in the congruent audiovisual presentation, i.e. when the articulation mouth gestures were congruent with the emission of the string of phones, the voice spectrum and the lip kinematics varied according to the pronounced strings of phonemes. In the McGurk paradigm the participants were unaware of the incongruence between visual and acoustical stimuli. The acoustical analysis of the participants’ spoken responses showed three distinct patterns: the fusion of the two stimuli (the McGurk effect), repetition of the acoustically presented string of phonemes, and, less frequently, of the string of phonemes corresponding to the mouth gestures mimicked by the actor. However, the analysis of the latter two responses showed that the formant 2 of the participants’ voice spectra always differed from the value recorded in the congruent audiovisual presentation. It approached the value of the formant 2 of the string of phonemes presented in the other modality, which was apparently ignored. The lip kinematics of the participants repeating the string of phonemes acoustically presented were influenced by the observation of the lip movements mimicked by the actor, but only when pronouncing a labial consonant. The data are discussed in favor of the hypothesis that features of both the visual and acoustical inputs always contribute to the representation of a string of phonemes and that cross-modal integration occurs by extracting mouth articulation features peculiar for the pronunciation of that string of phonemes.
KeywordsMcGurk effectAudiovisual integrationVoice spectrum analysisLip kinematicsImitation
Most of the linguistic interactions occur within a face-to-face context, in which both acoustic (speech) and visual information (mouth movements) are involved in message comprehension. Although humans are able to understand words without any visual input, audiovisual perception is shown to improve language comprehension (Sumby and Pollack 1954), even when the acoustic information is perfectly clear (Reisberg et al. 1987). In support of this behavioral observation, brain-imaging studies have shown that, when the speaker is also seen by an interlocutor, the activation of the acoustical A1/A2 and visual V5/MT cortical areas is greater than when the information is presented only in either acoustical or visual modality (Calvert et al. 2000). In addition, speech-reading activates acoustical areas also in absence of any acoustical input (Calvert et al. 1997).
Two hypotheses, though not mutually exclusive, can explain the integration between information on verbal messages provided by the two sensory (acoustical and visual) modalities. The first hypothesis is based on specific cross-modal binding functions, and it postulates supra-modal integration (Calvert et al. 1999, 2000; Calvert and Campbell 2003). This integration could be based on similar patterns of time-varying features common to both the acoustical and the visual input. More specifically, the timing of changes in vocalization is visible as well as audible in terms of their time-varying patterns (Munhall and Vatikiotis-Bateson 1998). For example, variations in speech sound amplitude can be accompanied by visible indicators of changes in the mouth articulator’s movement pattern. Another cross-modal function is based on features of stilled (configurational) besides moving face images (Calvert and Campbell 2003). Anatomically, cortical regions along the superior temporal sulcus (STS) may be involved in specific cross-modal functions. STS is activated by observation of biological motion including mouth movements during speech (Bonda et al. 1996; Buccino et al. 2004; Calvert et al. 2000; Campbell et al. 2001) and shows consistent and extensive activation also when hearing speech (Calvert et al. 1999, 2000). Calvert et al. (2000) observed that for audiovisual appropriately synchronized speech, the profile of STS activation correlated with enhanced neuronal activity in sensory-specific visual (V5/MT) and auditory (A1/A2) cortices. This cross-modal gain may be mediated by back projections from STS to sensory cortices (Calvert et al. 1999).
The second hypothesis is based on the possibility that presentation of either a human voice pronouncing a string of phones or a face mimicking pronunciation of a string of phonemes activates automatic imitation of the two stimuli. It is possible that the information provided by the two different modalities is integrated by superimposing an imitation mouth program automatically elicited by the visual stimulus on another one automatically elicited by the acoustical stimulus, in accordance with the motor theory of speech perception (Liberman and Mattingly 1985). In this respect, cortical regions within Broca’s area may be involved in audiovisual integration by imitation since it is activated by observation/imitation of moving and speaking faces (Buccino et al. 2004; Calvert and Campbell 2003; Campbell et al. 2001; Carr et al. 2003; Leslie et al. 2004; for a review see Bookheimer 2002). The activity of Broca’s area is significantly correlated with the increased excitability of the motor system underlying speech production when perceiving auditory speech (Waltkins and Paus 2004). This area is involved also in observation/imitation of hand movements (Iacoboni et al. 1999; Buccino et al. 2001, 2004; Heiser et al. 2003) according to the hypothesis that this area represents one of the putative sites of the human “mirror system”, which is thought to be evolved from the monkey premotor cortex, and to have acquired new cognitive functions such as speech processing (Rizzolatti and Arbib 1998).
The McGurk effect (McGurk and MacDonald 1976) represents a particular kind of audiovisual integration in which the acoustical information on a string of phonemes is contrasting with the visually presented mouth articulation gesture. When people process two different syllables, one presented in the visual modality and the other presented in acoustical modality, they tend either to fuse or to combine the two elements. For example, when the voice of the talker pronounces the syllable/ba/and her/his lips mimic the syllable/ga/, the observer tends to fuse the two syllables and to perceive the syllable/da/. Conversely, when the talker’s voice pronounces/ga/and her/his lips mimic/ba/, the observer tends to combine the two elements and to perceive either/bga/or/gba/.
The finding that combination rather than fusion between the two strings of phonemes occurs when the visual information on the syllable is unambiguous (/ba/versus/ga/) suggests that merging the visual information with the acoustical information such as observed in the fusion effect occurs only in particular circumstances, i.e. when the visual stimulus offers multiple interpretations on the string of phonemes (note that external mouth pattern of/ga/is not much different from that of/da/). The trend to fuse auditory and visual speech together seems to have some characteristics of specificity for the used language. Indeed, although it has been well documented for English speakers (for a review Chen and Massaro 2004; Summerfield 1992; Massaro 1998), some Asian people such as Japanese and Chinese, are less subjected to the McGurk effect (Chen and Massaro 2004; Sekiyama and Tohkura 1993). These data pose the following problem: does the process of audiovisual matching code representations lacking features of either the visual or the acoustical stimulus or, in contrast, does it code representations always containing features of both the two information? In the present study we tested the two hypotheses by taking into account in the McGurk’s paradigm the responses in which the participants repeated either the visually or the acoustically presented string of phonemes. By using techniques of kinematics analysis and voice spectra analysis, we verified whether the two presentations always influenced the responses. In particular, we verified whether the voice spectra of the repeated string of phonemes changed as compared to the voice spectra of the same string of phonemes repeated in the condition of congruent visual and acoustical stimuli. Moreover, we verified whether they approached the voice spectra of the string of phonemes presented in the other sensory modality.
A second problem is whether audiovisual integration is based either on superimposition of two automatic imitation motor programs or on cross-modal elaboration. The imitation hypothesis postulates that speech perception occurs by automatically integrating the mouth articulation pattern elicited by the acoustical with that elicited by the visual stimulus (Liberman and Mattingly 1985). The cross-modal hypothesis postulates that perception occurs by supra-modal integration of time-varying characteristics of speech extracted from both the visual and the acoustical stimulus (Calvert et al. 1999, 2000; Calvert and Campbell 2003). To test the two hypotheses we analyzed the responses in which the acoustically presented string of phonemes was repeated and verified whether its external mouth pattern was influenced by the visual stimulus, i.e. by the external mouth pattern mimicked by the actor. If two automatic imitation motor programs are superimposed, an effect of the visual stimulus on the observer’s external mouth pattern is always seen. This should occur also when the string of phonemes mimicked by the actor requires peculiar modification of the internal mouth and external mouth movements are consequent and indirectly related to pronunciation of the string of phonemes (in the present study/aga/). On the other hand, if time-varying features specific for the string of phonemes are extracted from the visual stimulus (cross-modal integration hypothesis) we should observe an effect of the only visually presented string of phonemes with labial consonants, i.e. with external mouth modification peculiar to pronunciation of that string of phonemes (in the present study/aba/).
Sixty-five right-handed (according to Edinburgh inventory, Oldfield 1971) Italian-speakers (51 females and 14 males, age 22–27 years.) participated in the present study. The study, to which the participants gave written informed consent, was approved by the Ethics Committee of the Medical Faculty of the University of Parma. All participants were naïve as to the McGurk paradigm and, consequently, to the purpose of the study. They were divided in three groups of eight, 31 and 26 individuals. Each group took part in one of three experiments (see below).
Participants sat in front of a table, placing either their forearms on the table plane, in a soundproof room. They were required not to move their head and trunk throughout the experimental session. A PC screen placed on the table plane was 40 cm distant from the participant’s chest. Two loudspeakers were at the two sides of the display. The stimuli presented on the PC screen were the following three strings of letters or phonemes: ABA (/aba/), ADA (/ada/) and AGA (/aga/). Note that in Italian the vowel A is always pronounced/a/. In experiment 1 (string of letters reading) they were printed in white on the centre of the black PC display. Each letter was 3.9 cm high and 2.5 cm wide. It was presented 1,360 ms from the beginning of the trial and lasted 1,040 ms. In experiments 2 and 3 (audiovisual presentation of string of phonemes) an actor (face: 6.9×10.4 cm) pronounced the three strings of phonemes. His half-body was presented 2,360 ms after the beginning of the trial and presentation lasted 2,000 ms. In all the experiments a ready signal, i.e. a red circle and a BEEP (duration of 360 ms) were presented at the beginning of the trial.
The following three experiments were carried out:
Experiment 1. Eight subjects participated in the experiment. The participants were presented with the printed strings of letters. The task was to read silently, and, then, to repeat aloud the string of letters (string-of-letters reading paradigm).
Experiment 2. Thirty-one subjects participated in the experiment. The actor pronounced one of the three strings of phonemes. In the congruent audiovisual presentation, his visible mouth (visual stimulus) mimicked and his voice (acoustic stimulus) pronounced the same string of phonemes. In the incongruent audiovisual presentation, the visible actor’s mouth mimicked pronunciation of AGA, whereas his voice concurrently pronounced ABA (McGurk paradigm).
Experiment 3. Twenty-six subjects participated in the experiment. The experiment differed from experiment 2 only for the incongruent audiovisual presentation in which the visible actor’s mouth mimicked pronunciation of ABA, whereas his voice simultaneously pronounced AGA (inverse McGurk paradigm).
In all the experiments the participants were required to repeat aloud, at the end of the audio and/or visual stimulus presentation, the perceived string, using a neutral intonation and a voice volume like during normal conversation. They were not informed that in some trials the visual and acoustical stimuli were incongruent. No constraint in response time was given. At the end of the experimental session, all participants filled in a questionnaire in which they indicated (1) whether during the experimental session the sound of each string of phonemes (i.e. ABA, ADA, and AGA) varied and (2) whether they noticed that in some trials there was incongruence between the acoustical and the visual stimulus. Each string of letters or phonemes was randomly presented 5 times. Consequently, experiment 1 consisted of 15 trials. On the other hand, since experiments 2 and 3 included both congruent and incongruent conditions, they consisted of 20 trials each.
The voice emitted by the participants and the actor was recorded by means of a microphone (Studio Electret Microphone, 20–20,000 hz, 500 Ω, 5 mv/Pa/1 kHz) placed on a table support. The centre of the support was 8.0 cm distant from the subject’s chest, on the right with respect to the participant and 8.0 cm distant from the participant’s sagittal axis. The microphone was connected to a PC by a sound card (16 PCI Sound Blaster, CREATIVE Technology Ltd. Singapore). The spectrogram of each string of phonemes was computed using the PRAAT software (University of Amsterdam, the Netherlands). The time courses of the formant (F) 1 and 2 of the participants and the actor were analyzed. The time course of the string-of-phonemes pronunciation was divided in three parts. The first part (T1-phase) included pronunciation of the first/a/vowel and the formant transition before mouth occlusion. The latter approximately corresponded to the first mouth closing movement. The second part (T0-phase) included mouth occlusion. Only the mouth occlusion of ABA pronunciation corresponded to the final lip closure. The mouth occlusion of the other strings corresponded to the final closure of internal mouth parts not recorded by kinematics techniques. The third part (T2-phase) included the formant transition during release of mouth occlusion, approximately corresponding to the second mouth opening movement, and pronunciation of the second/a/vowel. The durations of participants’ T1-phase, T0-phase and T2-phase were measured. Mean values of F1 and F2 of the participants and of the actor during the T1-phase and the T2-phase were calculated. Finally, participants’ and actor’s mean intensity of voice during string of phonemes pronunciation was measured. Mean F1 of the actor’s voice was 820, 721, and 746 Hz and mean F2 was 1,330, 1,393, and 1,429 Hz when pronouncing ABA, ADA and AGA, respectively. Intensity was on average 54.9 db.
In experiment 1 the statistical analyses on the lip kinematics and the voice spectra of the pronunciation of ABA, ADA, and AGA were carried out in order to discover differences in lip kinematics and voice spectra among the three strings of phonemes. In experiments 2 and 3, the statistical analyses compared lip kinematics and voice spectra of the strings of phonemes pronounced in the congruent audiovisual presentation with those in the incongruent audiovisual presentation. The aim was to verify whether the string of phonemes in the incongruent condition differed from the corresponding string of phonemes in the congruent condition and, if so, the direction of the change. The experimental design included string of letters or phonemes (ABA, ADA, AGA, and in experiments 2 and 3, the string of phonemes pronounced in the incongruent audiovisual presentation) as within-subjects factor for maximal lip aperture, lip closure, peak velocity of lip opening, and intensity of voice. In contrast, it included string of letters or phonemes (ABA, ADA, AGA, and in experiments 2 and 3 the string of phonemes pronounced in the incongruent audiovisual presentation) and phase (T1, and T2) for F1 and F2. Finally, it included string of letters or phonemes (ABA, ADA, AGA, and in experiments 2 and 3 the string of phonemes pronounced in incongruent audiovisual presentation) and phase (T1, T0, and T2) for time course of formant. The latter analysis aimed to discover differences in duration of vowel (including formant transition) and consonant pronunciation between strings of phonemes pronounced in the congruent and incongruent audiovisual presentations. Separate ANOVAs were carried out on mean values of the participants’ parameters. The Newman-Keuls post-hoc test was used (significance level set at P<0.05).
Experiment 1: string-of-letters reading paradigm.
Experiment 2: McGurk paradigm
At the end of the experiment all participants reported that they never noticed incongruence between the visual and acoustical stimulus. In addition, they reported: “In some trials the same string of phonemes was differently pronounced”. An acoustical analysis of the participants’ spoken responses in the incongruent audiovisual presentation showed that in the most of the trials 21 out of 31 participants repeated ABA, eight participants repeated ADA (the McGurk fusion effect), whereas two participants repeated AGA. Figure 2 shows examples of spectrograms in the condition of incongruence between the visual (AGA) and the acoustical (ABA) presentation. The participants repeating either ABA or ADA showed a formant pattern similar to those in the congruent audiovisual presentation and in experiment 1 (Fig. 2).
We performed statistical analyses on voice spectra and lip kinematics of the 21 participants who repeated ABA and of the eight participants who repeated ADA. We compared the voice spectra recorded in the incongruent audiovisual presentation with those recorded in the congruent audiovisual presentation. The analyses showed that F2 significantly increased moving from ABA to AGA (F(3,60)=110.4, P<0.000001, F(3,21)=26.1, P<0.0001, Fig. 3). F2 of the two ABA pronunciations significantly differed from each other, whereas F2 of the two ADA pronunciations did not (Fig. 3). F2 of ABA in the incongruent presentation (‘ABA’ in Fig. 3) increased approaching the F2 value of AGA. In other words, F2 of ABA repetition in the incongruent audiovisual presentation was influenced by the visually presented AGA. F1 decreased moving from ABA to AGA (ABA repetition: F(3,60)=408.6, P<0.00001, 801.1 vs. 794.7 vs. 776.4 Hz; ADA repetition: F(3,21)=4.7, P<0.01, 833.0 vs. 820.1 vs. 815.5 Hz). F1 of the two ABA (801.1 vs. 798.7 Hz) and ADA (820.1 vs. 818.2 Hz) pronunciations did not differ from each other.
The duration of T1 (251.5, 236.8 ms) was longer than both durations of T0 (116.6, 108.3 ms) and T2 (151.3, 117.0 ms) (F(2,40)=122.1, P<0.000001, F(2,14)=44.7, P<0.00001). No significant difference was found between the two ABA and ADA durations. These results indicate that the difference observed between F2 of the two ABA did not depend on variation in duration of T1 and T2. Indeed, a decrease/increase in duration of T1 and T2 due to shortening/lengthening of pure vowel pronunciation duration could induce a decrease/increase in F2, even if the single F2 values of formant transition and pure vowel could not vary.
Lip closure significantly increased (F(3,60)=50.2, P<0.00001; F(3,21)=36.0, P<0.001, Fig. 4) and peak velocity of lip opening decreased (F(3,60)=157.1, P<0.00001, F(3,21)=69.2, P<0.00001, Fig. 4) moving from ABA to AGA. No significant difference was found between the lip kinematics of the two ABA pronunciations and between the lip kinematics of the two ADA pronunciations.
Experiment 3: inverse McGurk paradigm
The report of the participants at the end of the experiment was similar to that in experiment 2. An acoustical analysis of the participants’ spoken responses in the incongruent audiovisual presentation showed that in the most of the trials 14 out of 26 participants repeated AGA, eight participants repeated ABA, three participants repeated ACA (/aka/), and, finally, one participant repeated ABGA (/abga/).
We performed statistical analyses on the 14 participants who repeated AGA and on the eight participants who repeated ABA. In both cases of AGA and ABA repetition F2 significantly increased moving from ABA to AGA (F(3,39)=34.0, P<0.00001, F(3,21)=17.3, P<0.0001, Fig. 3). Most importantly, F2 of the two AGA and ABA pronunciations significantly differed from each other (Fig. 3). F2 of AGA pronounced in the incongruent audiovisual presentation (‘AGA’ in Fig. 3) significantly decreased, whereas F2 of ABA pronounced in the incongruent audiovisual presentation significantly increased (‘ABA’ in Fig. 3) as compared to F2 of the same strings of phonemes pronounced in the congruent audiovisual presentation. Summing up, in the inverse McGurk paradigm, the acoustically presented AGA and the visually presented ABA influenced voice spectra of ABA and AGA repetitions, respectively. In the case of ABA repetition, F1 of ABA (837.1 Hz) was higher than F1 of both ADA (819.3 Hz) and AGA (814.8 Hz, F(3,21)=3.7, P<0.05). However, no effect of the incongruent audiovisual presentation was observed on F1 of ABA (837.1 vs. 834.1 Hz).
The duration T1 (202.4, 207.6 ms) was longer than the duration of T0 (108.0, 107.3 ms) and T2 (112.6, 120.7 ms) (F(2,26)=32.5, P<0.00001; F(2,14)=15.8, P<0.0005). No significant difference was found between the durations of the two AGA and the two ABA pronunciations.
Lip closure significantly increased (F(3,39)=31.9, P<0.00001, F(3,21)=21.6, P<0.00001, Fig. 4) and peak velocity of lip opening decreased (F(3,39)=40.0, P<0.00001, F(3,21)=23.9, P<0.00001, Fig. 4) moving from ABA to AGA. Post-hoc comparisons showed that final lip closure significantly decreased and peak velocity of lip opening significantly increased when AGA was pronounced in the incongruent audiovisual presentation as compared to AGA in the congruent presentation (Fig. 4). In contrast, no significant difference was found between the two ABA repetitions. Summing up, the observation of the lip kinematics of only labials (ABA visual presentation) influenced the lip kinematics of AGA repetition.
The participants in the present study relied more on the acoustical than the visual information (approximately 70%) when repeating aloud a string of phonemes presented acoustically by an actor whose mouth, in contrast, mimicked pronunciation of another string of phonemes. The prevalence of this acoustic effect was more frequent also than the McGurk fusion effect. The McGurk paradigm was never systematically tested on Italian speakers. It is well known the Italian phonemic repertoire and phonetic realization of syllables is simpler than those of other languages as, for example, English. Consequently, phonemic acoustical identification is simple enough not to require a strong reliance on additional visual cues (speech-reading). This hypothesis is in accordance with the results of previous studies in which the McGurk effect was compared among different languages (for a review see Chen and Massaro (2004). Chen and Massaro 2004, see also Massaro 1998; Sekiyama et al. 2003) showed that, when integrating acoustical with visual source of information on speech, each source is more influential if less complex. This behavioral principle was formalized by Massaro (1998) as the fuzzy logical model of perception (FLMP). Using this model, this author proposed that the type of processing of acoustical and visual source of information is universal across languages even if the process effects can be different.
Although the listening of the responses showed that the participants frequently relied on either the acoustic or the visual stimulus alone, more sophisticated analyses such as the voice spectra and the kinematics analyses showed that in these responses, they were influenced also by the stimulus presented in the other modality. F2 in the voice spectrum of ABA pronounced in the incongruent audiovisual presentation significantly increased as compared to ABA pronounced in the congruent presentation. The control experiment 1 and the condition of congruent audiovisual presentation in experiments 2 and 3 showed that F2 of AGA is higher than F2 of ABA. Consequently, in the incongruent audiovisual presentation the participants repeating ABA were likely to be affected also by the AGA presentation. This was found when AGA was either visually or acoustically presented. Conversely, F2 of AGA pronounced in the incongruent audiovisual presentation decreased more than AGA pronounced in the congruent audiovisual presentation, approaching the value of F2 of ABA. No mutual influence between the two modalities of presentation was observed for F1. This finding probably depends on the same pattern of the AGA and ABA formant transitions, and, consequently, on a smaller variation in F1 between the two strings of phonemes. In contrast, the pattern of the formant transition differed between the F2 of AGA and ABA and greater variation in F2 was observed between the two strings of phonemes. Thus, it is plausible to suppose that the mutual influence between AGA and ABA consequent to the contrasting audiovisual presentations were more detectable for F2 than F1. The finding that for all the strings of phonemes, the variations in F2 were in direction of F2 of the string of phonemes presented in the other sensory modality suggests a different perception of the string of phonemes. This was further supported by the participants’ final report. However, the mutual influence between the two strings of phonemes modified the values of F2, but did not reach the threshold in order to change the pattern of the formant transition, as it occurred in the McGurk fusion effect. In other words, the participants perceived a different sound of the same string of phonemes, rather than perceiving a different string of phonemes. Note that at the end of the experimental session, all the participants reported that they were unaware about the fact that the actor’s mouth mimicked pronunciation of a string of phonemes different from that acoustically pronounced. Taken together, these data support the hypothesis that the representation resulting from automatically matching the acoustical stimulus with the visual stimulus always contains features of both sources of information. However, we have no explanation why, when the two inputs were integrated, the strength of the acoustical or the visual information could change. We may hypothesize that integration was tuned by probabilistic changes in perception of the stimuli from one instance to the next. Random shift of attention to either the visual or the acoustical stimulus could contribute to a different stimulus perception, even if the participants were not required to pay greater attention to any of the two stimuli.
The lip kinematics of the actor and the participants significantly differed when pronouncing ABA and AGA. Consequently, we could detect an influence of the actor’s lip movements on the lip kinematics of the participants repeating the acoustical stimulus (ABA and AGA in experiments 2 and 3, respectively). However, the visually presented ABA influenced the lip kinematics of AGA repetition, whereas no influence of the visually presented AGA was observed on the lip kinematics of ABA repetition. This result discards the hypothesis that the visual and the acoustical inputs were integrated by imitating whatever visually detected time-varying motor patterns of the external mouth. In contrast, only the perceived time-varying lip motor pattern of a labial consonant, i.e. a consonant requiring characteristic lip movements in order to be correctly pronounced, was effective to induce changes in the lip kinematics of an observer pronouncing another string of phonemes. These modifications affected F2, which depends also on the volume of the anterior mouth cavity (Ferrero et al. 1979). Lip movements not directly related to consonant pronunciation, such as those of the AGA string of phonemes, did not influence pronunciation of another string of phonemes. The consonant/g/requires characteristic modification of the internal mouth. Consequently, the observation of the motor pattern of the visible internal mouth during AGA pronunciation could influence the kinematics of the visible internal mouth and, consequently, F2 of ABA pronunciation. In addition, it could induce the fusion effect (ADA pronunciation). The fusion effect could be elicited by imitation of the observed lip movements of AGA pronunciation, since the lip kinematics of ADA differed from those of ABA and approached those of AGA. However, it is not parsimonious to suppose that when pronouncing ABA in experiment 3, the observation of inner mouth movements affected the voice spectra. In contrast, when pronouncing ADA in experiment 2, the observation (and, probably the imitation) of outer mouth movements influenced more strongly the voice spectra, and, in particular, F2 (see Fig. 2), which is mainly related to configurations of the inner rather than the outer mouth.
Summing up, only the kinematics peculiar to consonant pronunciation of the presented string of phonemes was extracted from the visual stimulus and integrated with the acoustical stimulus, as the variation in lip kinematics shows and, consequently, the voice spectra of the repeated string of phonemes. Extraction of specific information was necessarily related to a different perception of the string of phonemes. If this were not the case, other visual information poorly related to speech should be integrated with sound. These data support the hypothesis about cross-modal integration between the two inputs, rather than superimposition of automatic imitation motor programs of acoustically on visually detected motor patterns. Our data further suggest that cross-modal integration provides graded and continuous information about the speech category. This is in favor of the Massaro’s (1998) hypothesis according to which speech is not categorically produced, but it reflects the perceptual processing that led to categorization. However, the mouth motor pattern characteristic of a string of phonemes can be extracted and those not strictly necessary to the pronunciation can be discarded only by means of execution of mouth motor programs and detection of the execution effects. This is in accordance with the motor theory of speech perception (Liberman and Mattingly 1985). Consequently, we suggest that imitation may be used at a filter stage before cross-modal integration, as supported by the finding that infants use imitation in order to learn speech (Meltzoff 2002). Broca’s area, which is known to be involved in encoding phonological representations in terms of mouth articulation gestures (Demonet et al. 1992; Paulesu et al. 1993; Zatorre et al. 1992) is activated also by imitation of face movements (Carr et al. 2003; Grèzes et al. 2003; Leslie et al. 2004). In addition, it is activated by observation of lip reading and by repetition of perceived auditory speech (Buccino et al. 2004; Waltkins and Paus 2004; for a review Bookeimeir 2002). Conversely, STS seems to be mainly involved in the integration between the two modalities of speech presentation (Calvert and Campbell 2003; Calvert et al. 1999, 2000).
We whish to thank Paola Santunione and Andrea Candiani for the help in carrying out the experiments and Dr. Cinzia Di Dio for the comments on the manuscript. The work was supported by grant from MIUR (Ministero dell’Istruzione, dell’Università e della Ricerca) to M.G.