Audiovisual Integration in Speaker Identification

Schweinberger, Stefan R.

doi:10.1007/978-1-4614-3585-3_6

Stefan R. Schweinberger⁴

1937 Accesses
2 Citations

Abstract

Audiovisual integration (AVI) is well-known during speech perception, but evidence for AVI in speaker identification has been less clear. This chapter reviews evidence for face–voice integration in speaker identification. Links between perceptual representations mediating face and voice identification, tentatively suggested by behavioral evidence more than a decade ago, have been recently supported by neuroimaging data indicating tight functional connectivity between the fusiform face and temporal voice areas. Research that recombined dynamic facial and vocal identities with precise synchrony provided strong evidence for AVI in identifying personally familiar (but not unfamiliar) speakers. Electrophysiological data demonstrate AVI at multiple neuronal levels and suggest that perceiving time-synchronized speaking faces triggers early (∼50–80 ms) audiovisual processing, although audiovisual speaker identity is only computed ∼200 ms later.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Like many other studies, it needs to be noted that this experiment used static faces. On the one hand, the study is therefore subject to the limitations mentioned earlier; on the other hand, this might be further evidence that even static faces can elicit some crossmodal effects (Joassin, Maurage, Bruyer, Crommelinck, & Campanella, 2004; Joassin et al., 2011).
2.
It could be speculated whether differences in timing might have been a consequence of the use of temporally extended sentence stimuli in Schweinberger, Kloth, and Robertson (2011) and Schweinberger, Walther, Zäske, and Kovacs (2011). However, in as yet unpublished research, we have now repeated the same experiment using brief syllabic stimuli similar to those used in the McGurk-paradigm, and replicated the crucial results, in terms of an early frontocentral negativity around 50–80 ms to bimodal stimuli, and an onset of speaker identity correspondence effects around 250 ms.

References

Andics, A., McQueen, J. M., Petersson, K. M., Gal, V., Rudas, G., & Vidnyanszky, Z. (2010). Neural mechanisms for voice recognition. NeuroImage, 52, 1528–1540.
Article PubMed Google Scholar
Belin, P., Bestelmeyer, P. E. G., Latinus, M., & Watson, R. (2011). Understanding voice perception. British Journal of Psychology, 102, 711–725.
Article PubMed Google Scholar
Belin, P., Fecteau, S., & Bedard, C. (2004). Thinking the voice: Neural correlates of voice perception. Trends in Cognitive Sciences, 8, 129–135.
Article PubMed Google Scholar
Belin, P., & Zatorre, R. J. (2003). Adaptation to speaker’s voice in right anterior temporal lobe. NeuroReport, 14, 2105–2109.
Article PubMed Google Scholar
Belin, P., Zatorre, R. J., Lafaille, P., Ahad, P., & Pike, B. (2000). Voice-selective areas in human auditory cortex. Nature, 403, 309–312.
Article PubMed CAS Google Scholar
Benson, P. J., & Perrett, D. I. (1991). Perception and recognition of photographic quality facial caricatures: Implications for the recognition of natural images. European Journal of Cognitive Psychology, 3, 105–135.
Article Google Scholar
Bricker, P. D., & Pruzansky, S. (1966). Effects of stimulus content and duration on talker identification. Journal of the Acoustical Society of America, 40, 1441–1449.
Article PubMed CAS Google Scholar
Bruce, V., & Young, A. (1986). Understanding face recognition. British Journal of Psychology, 77, 305–327.
Article PubMed Google Scholar
Bruce, V., & Young, A. (2011). Face perception. Hove, UK: Psychology Press.
Google Scholar
Burton, A. M., Bruce, V., & Johnston, R. A. (1990). Understanding face recognition with an interactive activation model. British Journal of Psychology, 81, 361–380.
Article PubMed Google Scholar
Calvert, G. A., Brammer, M. J., & Iversen, S. D. (1998). Crossmodal identification. Trends in Cognitive Sciences, 2, 247–253.
Article PubMed CAS Google Scholar
Campanella, S., & Belin, P. (2007). Integrating face and voice in person perception. Trends in Cognitive Sciences, 11, 535–543.
Article PubMed Google Scholar
Charest, I., Pernet, C. R., Rousselet, G. A., Quinones, I., Latinus, M., Fillion-Bilodeau, S., et al. (2009). Electrophysiological evidence for an early processing of human voices. BMC Neuroscience, 10(127), 1–11.
Google Scholar
Colonius, H., Diederich, A., & Steenken, R. (2009). Time-Window-of-Integration (TWIN) model for saccadic reaction time: Effect of auditory masker level on visual-auditory spatial interaction in elevation. Brain Topography, 21, 177–184.
Article PubMed Google Scholar
de Gelder, B., & Vroomen, J. (2000). The perception of emotions by ear and by eye. Cognition & Emotion, 14, 289–311.
Article Google Scholar
Ellis, H. D., Jones, D. M., & Mosdell, N. (1997). Intra- and inter-modal repetition priming of familiar faces and voices. British Journal of Psychology, 88, 143–156.
Article PubMed Google Scholar
Formisano, E., De Martino, F., Bonte, M., & Goebel, R. (2008). “Who” Is Saying “What”? Brain-based decoding of human voice and speech. Science, 322, 970–973.
Article PubMed CAS Google Scholar
Fox, C. J., & Barton, J. J. S. (2007). What is adapted in face adaptation? The neural representations of expression in the human visual system. Brain Research, 1127, 80–89.
Article PubMed CAS Google Scholar
Garrido, L., Eisner, F., McGettigan, C., Stewart, L., Sauter, D., Hanley, J. R., et al. (2009). Developmental phonagnosia: A selective deficit of vocal identity recognition. Neuropsychologia, 47, 123–131.
Article PubMed Google Scholar
Ghazanfar, A. A., & Schroeder, C. E. (2006). Is neocortex essentially multisensory? Trends in Cognitive Sciences, 10, 278–285.
Article PubMed Google Scholar
Green, K. P., Kuhl, P. K., Meltzoff, A. N., & Stevens, E. B. (1991). Integration speech information across talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect. Perception & Psychophysics, 50, 524–536.
Article CAS Google Scholar
Hagan, C. C., Woods, W., Johnson, S., Calder, A. J., Green, G. G. R., & Young, A. W. (2009). MEG demonstrates a supra-additive response to facial and vocal emotion in the right superior temporal sulcus. Proceedings of the National Academy of Sciences of the United States of America, 106, 20010–20015.
PubMed CAS Google Scholar
Hanley, J. R., Smith, S. T., & Hadfield, J. (1998). I recognise you but I can’t place you: An investigation of familiar-only experiences during tests of voice and face recognition. Quarterly Journal of Experimental Psychology, 51A, 179–195.
Google Scholar
Haxby, J. V., Hoffman, E. A., & Gobbini, M. I. (2000). The distributed human neural system for face perception. Trends in Cognitive Sciences, 4, 223–233.
Article PubMed Google Scholar
Joassin, F., Maurage, P., Bruyer, R., Crommelinck, M., & Campanella, S. (2004). When audition alters vision: An event-related potential study of the cross-modal interactions between faces and voices. Neuroscience Letters, 369, 132–137.
Article PubMed CAS Google Scholar
Joassin, F., Pesenti, M., Maurage, P., Verreckt, E., Bruyer, R., & Campanella, S. (2011). Cross-modal interactions between human faces and voices involved in person recognition. Cortex, 47, 367–376.
Google Scholar
Kawahara, H., & Matsui, H. (2003). Auditory morphing based on an elastic perceptual distance metric in an interference-free time-frequency representation. IEEE Proceedings of ICASSP, 1, 256–259.
Google Scholar
Kovács, G., Zimmer, M., Banko, E., Harza, I., Antal, A., & Vidnyanszky, Z. (2006). Electrophysiological correlates of visual adaptation to faces and body parts in humans. Cerebral Cortex, 16, 742–753.
Article PubMed Google Scholar
Lander, K., & Chuang, L. (2005). Why are moving faces easier to recognize? Visual Cognition, 12, 429–442.
Article Google Scholar
Legge, G. E., Grossmann, C., & Pieper, C. M. (1984). Learning unfamiliar voices. Journal of Experimental Psychology: Learning, Memory, and Cognition, 10, 298–303.
Article Google Scholar
Leopold, D. A., O’Toole, A. J., Vetter, T., & Blanz, V. (2001). Prototype-referenced shape encoding revealed by high-level aftereffects. Nature Neuroscience, 4, 89–94.
Article PubMed CAS Google Scholar
McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748.
Article PubMed CAS Google Scholar
Munhall, K. G., Gribble, P., Sacco, L., & Ward, M. (1996). Temporal constraints on the McGurk effect. Perception & Psychophysics, 58, 351–362.
Article CAS Google Scholar
Natu, V., & O’Toole, A. J. (2011). The neural processing of familiar and unfamiliar faces: A review and synopsis. British Journal of Psychology, 102, 726–747.
Article PubMed Google Scholar
Navarra, J., Vatakis, A., Zampini, M., Soto-Faraco, S., Humphreys, W., & Spence, C. (2005). Exposure to asynchronous audiovisual speech extends the temporal window for audiovisual integration. Cognitive Brain Research, 25, 499–507.
Article PubMed Google Scholar
Neuner, F., & Schweinberger, S. R. (2000). Neuropsychological impairments in the recognition of faces, voices, and personal names. Brain and Cognition, 44, 342–366.
Article PubMed CAS Google Scholar
Pollack, I., Pickett, J. M., & Sumby, W. H. (1954). On the identification of speakers by voice. Journal of the Acoustical Society of America, 26, 403–406.
Article Google Scholar
Robertson, D. M. C., & Schweinberger, S. R. (2010). The role of audiovisual asynchrony in person recognition. Quarterly Journal of Experimental Psychology, 63, 23–30.
Article Google Scholar
Saint-Amour, D., De Sanctis, P., Molholm, S., Ritter, W., & Foxe, J. J. (2007). Seeing voices: High-density electrical mapping and source-analysis of the multisensory mismatch negativity evoked during the McGurk illusion. Neuropsychologia, 45, 587–597.
Article PubMed Google Scholar
Sams, M., Aulanko, R., Hämalainen, M., Hari, R., Lounasmaa, O. V., Lu, S.-T., et al. (1991). Seeing speech: Visual information from lip movements modifies activity in the human auditory cortex. Neuroscience Letters, 127, 141–145.
Article PubMed CAS Google Scholar
Schweinberger, S. R. (1996). Recognizing people by faces, names, and voices: Psychophysiological and neuropsychological investigations. University of Konstanz: Habilitation Thesis.
Google Scholar
Schweinberger, S. R. (2011). Neurophysiological correlates of face recognition. In A. J. Calder, G. Rhodes, M. H. Johnson, & J. V. Haxby (Eds.), The handbook of face perception (pp. 345–366). Oxford: Oxford University Press.
Google Scholar
Schweinberger, S. R., Casper, C., Hauthal, N., Kaufmann, J. M., Kawahara, H., Kloth, N., et al. (2008). Auditory adaptation in voice perception. Current Biology, 18, 684–688.
Article PubMed CAS Google Scholar
Schweinberger, S. R., Herholz, A., & Sommer, W. (1997). Recognizing famous voices: Influence of stimulus duration and different types of retrieval cues. Journal of Speech, Language, and Hearing Research, 40, 453–463.
PubMed CAS Google Scholar
Schweinberger, S. R., Herholz, A., & Stief, V. (1997). Auditory long-term memory: Repetition priming of voice recognition. Quarterly Journal of Experimental Psychology, 50A, 498–517.
Google Scholar
Schweinberger, S. R., Kloth, N., & Robertson, D. M. C. (2011). Hearing facial identities: Brain correlates of face-voice integration in person identification. Cortex, 47, 1026–1037.
Article PubMed Google Scholar
Schweinberger, S. R., Pickering, E. C., Jentzsch, I., Burton, A. M., & Kaufmann, J. M. (2002). Event-related brain potential evidence for a response of inferior temporal cortex to familiar face repetitions. Cognitive Brain Research, 14, 398–409.
Article PubMed Google Scholar
Schweinberger, S. R., Robertson, D., & Kaufmann, J. M. (2007). Hearing facial identities. Quarterly Journal of Experimental Psychology, 60, 1446–1456.
Article Google Scholar
Schweinberger, S. R., Walther, C., Zäske, R., & Kovacs, G. (2011). Neural correlates of adaptation to voice identity. British Journal of Psychology, 102, 748–764.
Article PubMed Google Scholar
Shah, N. J., Marshall, J. C., Zafiris, O., Schwab, A., Zilles, K., Markowitsch, H. J., et al. (2001). The neural correlates of person familiarity. A functional magnetic resonance imaging study with clinical implications. Brain, 124, 804–815.
Article PubMed CAS Google Scholar
Sheffert, S. M., & Olson, E. (2004). Audiovisual speech facilitates voice learning. Perception & Psychophysics, 66, 352–362.
Article Google Scholar
Soto-Faraco, S., & Alsius, A. (2009). Deconstructing the McGurk–MacDonald illusion. Journal of Experimental Psychology: Human Perception and Performance, 35, 580–587.
Article PubMed Google Scholar
Stein, B. E., & Stanford, T. R. (2008). Multisensory integration: Current issues from the perspective of the single neuron. Nature Reviews Neuroscience, 9, 255–266.
Article PubMed CAS Google Scholar
Stekelenburg, J. J., & Vroomen, J. (2007). Neural correlates of multisensory integration of ecologically valid audiovisual events. Journal of Cognitive Neuroscience, 19, 1964–1973.
Article PubMed Google Scholar
Sugiura, M., Shah, N. J., Zilles, K., & Fink, G. R. (2005). Cortical representations of personally familiar objects and places: Functional organization of the human posterior cingulate cortex. Journal of Cognitive Neuroscience, 17, 183–198.
Article PubMed Google Scholar
Summerfield, Q., MacLeod, A., McGrath, M., & Brooke, M. (1989). Lips, teeth, and the benefits of lipreading. In A. W. Young & H. D. Ellis (Eds.), Handbook of research on face processing (pp. 223–233). Amsterdam: North-Holland.
Google Scholar
van Wassenhove, V., Grant, K. W., & Poeppel, D. (2005). Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Sciences of the United States of America, 102, 1181–1186.
Article PubMed Google Scholar
van Wassenhove, V., Grant, K. W., & Poeppel, D. (2007). Temporal window of integration in auditory-visual speech perception. Neuropsychologia, 45, 598–607.
Article PubMed Google Scholar
VanLancker, D., & Kreiman, J. (1987). Voice discrimination and recognition are separate abilities. Neuropsychologia, 25, 829–834.
Article CAS Google Scholar
VanLancker, D., Kreiman, J., & Wickens, T. D. (1985). Familiar voice recognition: Patterns and parameters. Part II: Recognition of rate-altered voices. Journal of Phonetics, 13, 39–52.
Google Scholar
von Kriegstein, K., Kleinschmidt, A., Sterzer, P., & Giraud, A. L. (2005). Interaction of face and voice areas during speaker recognition. Journal of Cognitive Neuroscience, 17, 367–376.
Article Google Scholar
Walker, S., Bruce, V., & O’Malley, C. (1995). Facial identity and facial speech processing: Familiar faces and voices in the McGurk effect. Perception & Psychophysics, 57, 1124–1133.
Article CAS Google Scholar
Welch, R. B., & Warren, D. H. (1980). Immediate perceptual response to intersensory discrepancy. Psychological Bulletin, 88, 638–667.
Article PubMed CAS Google Scholar
Zäske, R., Schweinberger, S. R., & Kawahara, H. (2010). Voice aftereffects of adaptation to speaker identity. Hearing Research, 268, 38–45.
Article PubMed Google Scholar

Download references

Acknowledgments

The author’s research is supported by grants from the Deutsche Forschungsgemeinschaft (Grants Schw 511/6-2 and Schw511/10-1) in the context of the DFG Research Unit Person Perception (FOR1097). I am very grateful to Romi Zäske for helpful comments on an earlier draft of this chapter.

Author information

Authors and Affiliations

Department of General Psychology and Cognitive Neuroscience, DFG Research Unit Person Perception, Friedrich-Schiller-University of Jena, Am Steiger 3/1, 07743, Jena, Germany
Stefan R. Schweinberger

Authors

Stefan R. Schweinberger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefan R. Schweinberger .

Editor information

Editors and Affiliations

Dept. Psychology, Voice Neurocognition Lab., University of Glasgow, Glasgow, Glasgow, G12 8QB, United Kingdom
Pascal Belin
Brugmann University Hospital, Brussels, B-1020, Belgium
Salvatore Campanella
University of Tubingen, Tubingen, 72074, Germany
Thomas Ethofer

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Schweinberger, S.R. (2013). Audiovisual Integration in Speaker Identification. In: Belin, P., Campanella, S., Ethofer, T. (eds) Integrating Face and Voice in Person Perception. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-3585-3_6

Download citation

DOI: https://doi.org/10.1007/978-1-4614-3585-3_6
Published: 26 June 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-3584-6
Online ISBN: 978-1-4614-3585-3
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics