A Comparison of Models for Fusion of the Auditory and Visual Sensors in Speech Perception

Robert-Ribes, Jordi; Schwartz, Jean-Luc; Escudier, Pierre

doi:10.1007/978-94-009-1639-5_7

Jordi Robert-Ribes³,
Jean-Luc Schwartz³ &
Pierre Escudier³

61 Accesses
1 Citations

Abstract

Though a large amount of psychological and physiological evidence of audio-visual integration in speech has been collected in the last 20 years, there is no agreement about the nature of the fusion process. We present the main experimental data, and describe the various models proposed in the literature, together with a number of studies in the field of automatic audiovisual speech recognition. We discuss these models in relation to general proposals arising from psychology in the field of intersensory interaction, or from the field of vision and robotics in the field of sensor fusion. Then we examine the characteristics of four main models, in the light of psychological data and formal properties, and we present the results of a modelling study on audio-visual recognition of French vowels in noise. We conclude in favor of the relative superiority of a model in which the auditory and visual inputs are projected and fused in a common representation space related to motor properties of speech objects, the fused representation being further classified for lexical access.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abry, C. & Boë, L. J. (1986). Laws for Lips. Speech Communication 5: 97–104.
Article Google Scholar
Aulanko, R. & Sams, M. (1991). Integration of Auditory and Visual Components of Articulatory Information in the Human Brain. XII Congrès Intern, des Sciences Phonétiques, 19–24 Août 1991, Aix-en-Provence (France), pp. 38–41.
Google Scholar
Basir, O. A. & Shen, H. C. (1992). Sensory Data Integration: A Team Consensus Approach. Proceedings of The 1992 IEEE Int. Conf. on Robotics and Automation, Nice, France, May 1992, pp. 1683–1688.
Google Scholar
Beckerman, M. (1992). A Bayes-maximum entropy method for multi-sensor data fusion. Proceedings of The 1992 IEEE Int. Conf. on Robotics and Automation, Nice, France, May 1992, pp. 1668–1674.
Google Scholar
Benoît, C, Mohamadi, T. & Kandel, S. D. (in press). Effects of Phonetic Context on Audio-Visual Intelligibility of French. Journal of Speech and Hearing Research (in press).
Google Scholar
Binnie, C. A., Montgomery, A. A. & Jackson, P. L. (1974). Auditory and Visual Contributions to the Perception of Consonants. Journal of Speech and Hearing Research 17: 619–630.
Google Scholar
Bladon, R. A. W. & Lindblom, B. (1981). Modelling the Judgement of Vowel Quality Differences. J. Acoust. Soc. Am. 69: 1414–1422.
Article Google Scholar
Boë, L. J. & Perrier, P. (1993). Personal Communication.
Google Scholar
Boë, L. J., Schwartz, J. L. & Vallée, N., (to appear). The Prediction of Vowel Systems: Perceptual Contrast and Stability. In Keller, E. (ed.) Fundamentals of Speech Synthesis and Speech Recognition. John Wiley.
Google Scholar
Braida, D., Picheny, M. A., Cohen, J. R., Rabinowitz, W. M. & Perkell, J. S. (1986). Use of Articulatory Signals in Automatic Speech Recognition. J. Acoust. Soc. Am. 80: S18.
Article Google Scholar
Breeuwer, M. & Plomp, R. (1986). Speechreading Supplemented with Auditorily Presented Speech Parameters. J. Acoust. Soc. Am. 79: 481–499.
Article Google Scholar
Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound. The MIT Press: Cambridge, MA.
Google Scholar
Brooke, M. & Petajan, E. D. (1986). Seeing Speech: Investigations into the Synthesis and Recognition of Visible Speech Movements Using Automatic Image Processing and Computer Graphics. Conference Publication No. 258, Inter. Conf. on Speech Input/Output; Techniques and Applications (London 24–26 March 1986), pp. 104–109.
Google Scholar
Campbell, H. W. (1974). Phoneme Recognition by Ear and by Eye: A Distinctive Feature Analysis. Doctoral dissertation, Katholieke Universiteit te Nijmegen.
Google Scholar
Campbell, R. (1988). Tracing Lip Movements: Making Speech Visible. Visible Language 13(1).
Google Scholar
Campbell, R. & Dodd, B. (1980). Hearing by Eye. Quarterly Journal of Experimental Psychology, 32: 85–99.
Article Google Scholar
Cathiard, M. A., Lallouache, T., Mohamadi, T. & Abry, Ch. (1993). Etude perceptive des interactions Son/Image: Application au visiophone et à la visioconférence. Rapport CNET Marché 92 7B 032.
Google Scholar
Crowley, J. L. & Demazeau, Y. (1993). Principles and Techniques for Sensor Data Fusion. Signal Processing 32: 5–27.
Article Google Scholar
de Gelder, B. & Vroomen, J. (1992). Abstract Versus Modality-Specific Memory Representations in Processing Auditory and Visual Speech. Memory and Cognition 20: 533–538.
Article Google Scholar
de Gelder, B., Vroomen, J. & van der Heide, L. (1991). Face Recognition and Lip-Reading in Autism. European Journal of Cognitive Psychology, 3: 69–86.
Article Google Scholar
Delgutte, B. & Kiang, N. Y. S. (1984). Speech Coding in the Auditory Nerve: IV. Sounds with Consonant-Like Dynamic Characteristics. J. Acoust. Soc. Am. 75: 897–907.
Article Google Scholar
Dixon, N. F. & Spitz L. (1980). The Detection of Audiovisual Desynchrony. Perception 9: 719–721.
Article Google Scholar
Dodd, B. (1979). Lip-Reading in Infants: Attention to Speech Presented In- and Out-of-Synchrony. Cognitive Psychology 11: 478–484.
Article Google Scholar
Erber, N. P. (1969). Interaction of Audition and Vision in the Recognition of Oral Speech Stimuli. Journal of Speech and Hearing Research 12: 423–425.
Google Scholar
Erber, N. P. (1975). Auditory-Visual Perception of Speech. Journal of Speech and Hearing Disorders 40: 481–492.
Google Scholar
Faugeras, O., Ayache, N. & Faverjon, B. (1986). Building Visual Maps by Combining Noisy Stereo Measurements. IEEE Internat. Conf. on Robotics and Automation, San Francisco, CA.
Google Scholar
Finn, K. E. & Montgomery (1988). Automatic Optically Based Recognition of Speech. Pattern Recognition Letters 8: 159–164.
Article Google Scholar
Fowler, C. A. (1986). An Event Approach to the Study of Speech Perception from a Direct-Realist Perspective. J. of Phonetics 14: 3–28.
Google Scholar
Fowler, C. A. & Dekle, D. J. (1991). Listening with Eye and Hand: Crossmodal Contributions to Speech Perception. J. of Experimental Psychology: Human Perception and Performance 17: 816–828.
Article Google Scholar
Gibson, J. J. (1966). The Senses Considered as Perceptual Systems. Boston: Houghton Mifflin Co.
Google Scholar
Grant, K. W., Ardell, L. H., Kuhl, P. K. & Sparks, D. W. (1985). The Contribution of Fundamental Frequency, Amplitude Envelope, and Voicing Duration Cues to Speechreading in Normal-Hearing Subjects. J. Acoust. Soc. Am. 77: 671–677.
Article Google Scholar
Green, K. P. (1987). The Perception of Speaking Rate Using Visual Information from a Talker’s Face. Perception and Psychophysics 42: 587–593.
Article Google Scholar
Green, K. P. & Kuhl, P. K. (1989). The Role of Visual Information in the Processing of Place and Manner Features in Speech Perception. Perception and Psychophysics 45: 34–42.
Article Google Scholar
Green, K. P. & Kuhl, P. K. (1991). Integral Processing of Visual Place and Auditory Voicing Information During Phonetic Perception. Journal of Experimental Psychology: Human Perception and Performance 17: 278–288.
Article Google Scholar
Green, K. P. & Miller J. L. (1985). On the Role of Visual Rate Information in Phonetic Perception. Perception and Psychophysics 38: 269–276.
Article Google Scholar
Hatwell, Y. (1993). Transferts intermodaux et intégration intermodale. In Richelle, M., Reguin, J. & Robert, M. (eds.) Traité de Psychologie Expérimentale. Presses Universitaires de France: Paris.
Google Scholar
King, A. J. & Palmer, A. R. (1985). Integration of Visual and Auditory Information in Bimodal Neurones in the Guinea-pig Superior Colliculus. Expl. Brain Res. 60: 492–500.
Article Google Scholar
Klatt, D. H. (1979). Speech Perception: A Model of Acoustic-Phonetic Analysis and Lexical Access. J. of Phonetics 7: 279–312.
Google Scholar
Knudsen, E. I. (1982). Auditory and Visual Maps of Space in the Optic Tectum of the Owl. J. Neurosci. 2: 1177–1194.
Google Scholar
Konishi, M. (1986). Centrally Synthesized Maps of Sensory Space. Trends Neurosci 9: 163–168.
Article Google Scholar
Kuhl, P. K. & Meltzoff, A. N. (1982). The Bimodal Perception of Speech in Infancy. Science 218: 1138–1141.
Article Google Scholar
Kurita, T., Honda, K. & Kakita, Y. (1988). Analysis of Speech by a Joint Use of Image Processing of Lip Movements. I.E.I.C.E. Tech. Report, SP88–94, 41–48.
Google Scholar
Lallouache, M. T. (1991). Un poste ‘visage-parole’ couleur. Acquisition et traitement automatique des contours des lèvres. Doctoral Thesis, Institut National Polytechnique, Grenoble.
Google Scholar
Liberman, A. & Mattingly, I. (1985). The Motor Theory of Speech Perception Revised. Cognition 21: 1–33.
Article Google Scholar
Lisker, L. & Rossi, M. (1993). Auditory and Visual Cueing of the [± Rounded] Feature of Vowels. Language and Speech 35: 391–417.
Google Scholar
Massaro, D. W. (1987). Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry. Lawrence Erlbaum Associates: London.
Google Scholar
Massaro, D. W. (1989). Testing Between the TRACE Model and the Fuzzy Logical Model of Speech Perception. Cognitive Psychology 21: 398–421.
Article Google Scholar
Massaro, D. W. & Cohen, M. M. (1993). Perceiving Asynchronous Bimodal Speech in Consonant-Vowel and Vowel Syllables. Speech Communication 13: 127–134.
Article Google Scholar
Massaro, D. W., Cohen, M. M., Gesi, A., Heredia, R. & Tsuzaki, M. (1993). Bimodal Speech Perception: An Examination Across Languages. J. of Phonetics 21: 445–478.
Google Scholar
Massaro, D. W. & Warner, D. S. (1977). Dividing Attention Between Auditory and Visual Perception. Perception and Psychophysics 21: 569–574.
Article Google Scholar
Matsuoka, K., Furuya, T. & Kurosu, K. (1986). Speech Recognition by Image Processing of Lip Movements. Trans, on Soc. Instru. and Cont. Eng. 22: 191–198.
Google Scholar
McGurk, H. & MacDonald, J. (1976). Hearing Lips and Seeing Voices. Nature 264: 746–748.
Article Google Scholar
McLeod, A. & Summerfield, Q. (1987). Quantifying the Contribution of Vision to Speech Perception in Noise. British Journal of Audiology 21: 131–141.
Article Google Scholar
Meredith, M. A. & Stein, B. E. (1983). Interactions Among Converging Sensory Inputs in the Superior Colliculus. Science 221: 389–391.
Article Google Scholar
Morris, A. C. (1992). Analyse informationnelle du traitement de la parole dans le système auditif périphérique et le noyau cochléaire. Doctoral Thesis, Institut National Polytechnique, Grenoble.
Google Scholar
Petajan, E. D. (1984). Automatic Lipreading to Enhance Speech Recognition. Proceeding of the Global Communications Conference, IEEE Communication Society, Atlanta, Georgia, 265–272.
Google Scholar
Radeau, M. (1994). Auditory-Visual Spatial Interaction and Modularity. Current Psychology of Cognition 13: 3–51.
Google Scholar
Reed, C. M., Rabinowitz, W. M., Durlach, N. I. & Braida, L. D. (1985). Research on the Tadoma method of speech communication. J. Acoust. Soc. Am. 77(1): 247–257.
Article Google Scholar
Reisberg, D., McLean, J. & Goldfield, A. (1987). Easy to Hear but Hard to Understand: A Lipreading Advantage with Intact Auditory Stimuli. In Dodd, B. & Campbell, R. (eds.) Hearing by Eye: the Psychology of Lipreading, 97–113. Lawrence Erlbaum Associates: London.
Google Scholar
Risberg, A. & Lubker, J. L. (1978). Prosody and Speechreading. Speech Transmission Laboratory Quaterly Progress & Status Report, Stockholm, Vol. 4, 1–16.
Google Scholar
Robert-Ribes, J. (to appear). Models of Audiovisual Integration. Doctoral Thesis, Institut National Polytechnique, Grenoble.
Google Scholar
Robert-Ribes, J., Escudier, P. & Schwartz J. L. (1991). Modèles d’intégration audition-vision: une étude neuromimétique. Internal report, ICP, Grenoble.
Google Scholar
Robert-Ribes, J., Lallouache, T., Escudier, P. & Schwartz, J. L. (1993). Integrating Auditory and Visual Representations for Audiovisual Vowel Recognition. Proceedings of The Third European Conference on Speech Communication and Technology, 1753–1756 (by ESCA), Berlin Germany.
Google Scholar
Roberts, M. & Summerfield, Q. (1981). Audiovisual Presentation Demonstrates That Selective Adaptation in Speech Perception is Purely Auditory. Perception and Psychophysics 30: 309–314.
Article Google Scholar
Rönnberg, J., Arlinger, S., Lyxell, B. & Kinnefords, C. (1989). Visual Evoked Potentials: Relation to Adult Speechreading and Cognitive Function. J. of Speech and Hearing Research 32: 725–735.
Google Scholar
Rosen, S., Fourcin, A. J. & Moore B. (1981). Voice Pitch as an Aid to Lipreading. Nature 291: 150–152.
Article Google Scholar
Sams, M., Aulanko, R., Hämäläinen, M., Hari, R., Lounasmaa, O. V., Lu, S. & Simola, J. (1991). Seeing Speech: Visual Information from Lip Movements Modifies Activity in the Human Auditory Cortex. Neuroscience Letters 127: 141–145.
Article Google Scholar
Schroeder, M. R., Atal, B. S. & Hall, J. L. (1979). Objective Measure of Certain Speech Signal Degradations Based on Masking Properties of Human Auditory Perception. In Lindblom, B. & Ohman, S. (eds.) Frontiers of Speech Communication Research, 217–229. Academic Press: London.
Google Scholar
Sekiyama, K. & Tokhura, Y. (1993). Inter-Language Differences in the Influence of Visual Cues in Speech Perception. J. of Phonetics 21: 427–444.
Google Scholar
Shepherd, D. C., DeLavergne, R. W., Frueh, F. X. & Clobridge, C. (1977). Visual-Neural Correlate of Speechreading Ability in Normal-Hearing Adults. Journal of Speech and Hearing Research 20: 752–765.
Google Scholar
Stein, B. E. & Meredith, M. A. (1993). The Merging of the Senses. MIT Press: Cambridge.
Google Scholar
Stork, D. G., Wolff, G. & Levine, E. (1992). Neural Network Lipreading System for Improved Speech Recognition. IJCNN-92, Baltimore MD, Vol. 2, 285–295.
Google Scholar
Sumby, W. H. & Pollack, I. (1954). Visual Contribution to Speech Intelligibility in Noise. J. Acoust. Soc. Am. 26: 212–215.
Article Google Scholar
Summerfield, Q. (1979). Use of Visual Information for Phonetic Perception. Phonetica 36: 314–331.
Article Google Scholar
Summerfield, Q. (1987). Some Preliminaries to a Comprehensive Account of Audio-Visual Speech Perception. In Dodd, B. & Campbell, R. (eds.) Hearing by Eye: The Psychology of Lipreading, 3–51. Lawrence Erlbaum Associates: London.
Google Scholar
Summerfield, Q. (1992). Lipreading and Audio-Visual Speech Perception. In Bruce, Cowey, Ellis & Peret (eds.) Proceeding the Facial Image, 71–78. Clarendon Press: Oxford.
Google Scholar
Summerfield, Q. & McGrath, M. (1984). Detection and Resolution of Audio-Visual Incompatibility in the Perception of Vowels. Quarterly Journal of Experimental Psychology: Human Experimental Psychology 36: 51–74.
Google Scholar
Tamura, S. (1989). Lip Contour Extraction-Complement and Tracing by Using Energy Function and Optical Flow. Paper of Technical Group on Pattern Recognition and Understanding, I.E.I.C.E., PRU89-20, 9–16.
Google Scholar
Vroomen, J. H. M. (1992). Hearing Voices and Seeing Lips: Investigations in the Psychology of Lipreading. Doctoral dissertation, Katolieke Univ. Brabant.
Google Scholar
Watanabe, T. & Kohda, M. (1990). Lip-Reading of Japanese Vowels Using Neural Networks. Proceedings of The Int. Conf. Spoken Lang. Proces. 90, Kobe, Japan, 1373–1376.
Google Scholar
Welch, R. B. (1989). A Comparison of Speech Perception and Spatial Localization. Behavioral and Brain Sciences 12: 776–777.
Article Google Scholar
Welch, R. B. & Warren, D. H. (1986). Intersensory Interactions. In Boff, K. R., Kaufman, L. & Thomas, J. P. (eds.) Handbook of Perception and Human Performance, Volume I: Sensory Processes and Perception, 25-1–25-36. Wiley: New York.
Google Scholar
Yuhas, B. P., Goldstein, M. H. & Sejnowski, T. J. (1989). Integration of Acoustic and Visual Speech Signals Using Neural Networks. IEEE Communications Magazine 65–71.
Google Scholar
Yuhas, B. P., Goldstein, M. H., Sejnowski, T. J. & Jenkins, R. E. (1990). Neural Network Models of Sensory Integration for Improved Vowel Recognition. Proc. of the IEEE 78: 1658–1668.
Article Google Scholar
Yuhas, B. P. & Goldstein, M. H. (1991). Comparing Human and Neural Network Lipreaders. J. Acoust. Soc. Am., 90: 598–600.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institut de la Communication Parlée, CNRS UA 368, INPG/ENSERG, Université Stendhal \ INPG, 46 Av. Félix Viallet, 38031, Grenoble Cedex 1, France
Jordi Robert-Ribes, Jean-Luc Schwartz & Pierre Escudier

Authors

Jordi Robert-Ribes
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Luc Schwartz
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Escudier
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Sheffield, England, EU
Paul Mc Kevitt (EPSRC Advanced Fellow in Information Technology) (EPSRC Advanced Fellow in Information Technology)
Dún Na nGall (Donegal), Ireland, EU
Paul Mc Kevitt (EPSRC Advanced Fellow in Information Technology) (EPSRC Advanced Fellow in Information Technology)

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Robert-Ribes, J., Schwartz, JL., Escudier, P. (1995). A Comparison of Models for Fusion of the Auditory and Visual Sensors in Speech Perception. In: Mc Kevitt, P. (eds) Integration of Natural Language and Vision Processing. Springer, Dordrecht. https://doi.org/10.1007/978-94-009-1639-5_7

Download citation

DOI: https://doi.org/10.1007/978-94-009-1639-5_7
Publisher Name: Springer, Dordrecht
Print ISBN: 978-0-7923-3944-1
Online ISBN: 978-94-009-1639-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics