Skip to main content

A Comparison of Models for Fusion of the Auditory and Visual Sensors in Speech Perception

  • Chapter
Book cover Integration of Natural Language and Vision Processing

Abstract

Though a large amount of psychological and physiological evidence of audio-visual integration in speech has been collected in the last 20 years, there is no agreement about the nature of the fusion process. We present the main experimental data, and describe the various models proposed in the literature, together with a number of studies in the field of automatic audiovisual speech recognition. We discuss these models in relation to general proposals arising from psychology in the field of intersensory interaction, or from the field of vision and robotics in the field of sensor fusion. Then we examine the characteristics of four main models, in the light of psychological data and formal properties, and we present the results of a modelling study on audio-visual recognition of French vowels in noise. We conclude in favor of the relative superiority of a model in which the auditory and visual inputs are projected and fused in a common representation space related to motor properties of speech objects, the fused representation being further classified for lexical access.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Abry, C. & Boë, L. J. (1986). Laws for Lips. Speech Communication 5: 97–104.

    Article  Google Scholar 

  • Aulanko, R. & Sams, M. (1991). Integration of Auditory and Visual Components of Articulatory Information in the Human Brain. XII Congrès Intern, des Sciences Phonétiques, 19–24 Août 1991, Aix-en-Provence (France), pp. 38–41.

    Google Scholar 

  • Basir, O. A. & Shen, H. C. (1992). Sensory Data Integration: A Team Consensus Approach. Proceedings of The 1992 IEEE Int. Conf. on Robotics and Automation, Nice, France, May 1992, pp. 1683–1688.

    Google Scholar 

  • Beckerman, M. (1992). A Bayes-maximum entropy method for multi-sensor data fusion. Proceedings of The 1992 IEEE Int. Conf. on Robotics and Automation, Nice, France, May 1992, pp. 1668–1674.

    Google Scholar 

  • Benoît, C, Mohamadi, T. & Kandel, S. D. (in press). Effects of Phonetic Context on Audio-Visual Intelligibility of French. Journal of Speech and Hearing Research (in press).

    Google Scholar 

  • Binnie, C. A., Montgomery, A. A. & Jackson, P. L. (1974). Auditory and Visual Contributions to the Perception of Consonants. Journal of Speech and Hearing Research 17: 619–630.

    Google Scholar 

  • Bladon, R. A. W. & Lindblom, B. (1981). Modelling the Judgement of Vowel Quality Differences. J. Acoust. Soc. Am. 69: 1414–1422.

    Article  Google Scholar 

  • Boë, L. J. & Perrier, P. (1993). Personal Communication.

    Google Scholar 

  • Boë, L. J., Schwartz, J. L. & Vallée, N., (to appear). The Prediction of Vowel Systems: Perceptual Contrast and Stability. In Keller, E. (ed.) Fundamentals of Speech Synthesis and Speech Recognition. John Wiley.

    Google Scholar 

  • Braida, D., Picheny, M. A., Cohen, J. R., Rabinowitz, W. M. & Perkell, J. S. (1986). Use of Articulatory Signals in Automatic Speech Recognition. J. Acoust. Soc. Am. 80: S18.

    Article  Google Scholar 

  • Breeuwer, M. & Plomp, R. (1986). Speechreading Supplemented with Auditorily Presented Speech Parameters. J. Acoust. Soc. Am. 79: 481–499.

    Article  Google Scholar 

  • Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound. The MIT Press: Cambridge, MA.

    Google Scholar 

  • Brooke, M. & Petajan, E. D. (1986). Seeing Speech: Investigations into the Synthesis and Recognition of Visible Speech Movements Using Automatic Image Processing and Computer Graphics. Conference Publication No. 258, Inter. Conf. on Speech Input/Output; Techniques and Applications (London 24–26 March 1986), pp. 104–109.

    Google Scholar 

  • Campbell, H. W. (1974). Phoneme Recognition by Ear and by Eye: A Distinctive Feature Analysis. Doctoral dissertation, Katholieke Universiteit te Nijmegen.

    Google Scholar 

  • Campbell, R. (1988). Tracing Lip Movements: Making Speech Visible. Visible Language 13(1).

    Google Scholar 

  • Campbell, R. & Dodd, B. (1980). Hearing by Eye. Quarterly Journal of Experimental Psychology, 32: 85–99.

    Article  Google Scholar 

  • Cathiard, M. A., Lallouache, T., Mohamadi, T. & Abry, Ch. (1993). Etude perceptive des interactions Son/Image: Application au visiophone et à la visioconférence. Rapport CNET Marché 92 7B 032.

    Google Scholar 

  • Crowley, J. L. & Demazeau, Y. (1993). Principles and Techniques for Sensor Data Fusion. Signal Processing 32: 5–27.

    Article  Google Scholar 

  • de Gelder, B. & Vroomen, J. (1992). Abstract Versus Modality-Specific Memory Representations in Processing Auditory and Visual Speech. Memory and Cognition 20: 533–538.

    Article  Google Scholar 

  • de Gelder, B., Vroomen, J. & van der Heide, L. (1991). Face Recognition and Lip-Reading in Autism. European Journal of Cognitive Psychology, 3: 69–86.

    Article  Google Scholar 

  • Delgutte, B. & Kiang, N. Y. S. (1984). Speech Coding in the Auditory Nerve: IV. Sounds with Consonant-Like Dynamic Characteristics. J. Acoust. Soc. Am. 75: 897–907.

    Article  Google Scholar 

  • Dixon, N. F. & Spitz L. (1980). The Detection of Audiovisual Desynchrony. Perception 9: 719–721.

    Article  Google Scholar 

  • Dodd, B. (1979). Lip-Reading in Infants: Attention to Speech Presented In- and Out-of-Synchrony. Cognitive Psychology 11: 478–484.

    Article  Google Scholar 

  • Erber, N. P. (1969). Interaction of Audition and Vision in the Recognition of Oral Speech Stimuli. Journal of Speech and Hearing Research 12: 423–425.

    Google Scholar 

  • Erber, N. P. (1975). Auditory-Visual Perception of Speech. Journal of Speech and Hearing Disorders 40: 481–492.

    Google Scholar 

  • Faugeras, O., Ayache, N. & Faverjon, B. (1986). Building Visual Maps by Combining Noisy Stereo Measurements. IEEE Internat. Conf. on Robotics and Automation, San Francisco, CA.

    Google Scholar 

  • Finn, K. E. & Montgomery (1988). Automatic Optically Based Recognition of Speech. Pattern Recognition Letters 8: 159–164.

    Article  Google Scholar 

  • Fowler, C. A. (1986). An Event Approach to the Study of Speech Perception from a Direct-Realist Perspective. J. of Phonetics 14: 3–28.

    Google Scholar 

  • Fowler, C. A. & Dekle, D. J. (1991). Listening with Eye and Hand: Crossmodal Contributions to Speech Perception. J. of Experimental Psychology: Human Perception and Performance 17: 816–828.

    Article  Google Scholar 

  • Gibson, J. J. (1966). The Senses Considered as Perceptual Systems. Boston: Houghton Mifflin Co.

    Google Scholar 

  • Grant, K. W., Ardell, L. H., Kuhl, P. K. & Sparks, D. W. (1985). The Contribution of Fundamental Frequency, Amplitude Envelope, and Voicing Duration Cues to Speechreading in Normal-Hearing Subjects. J. Acoust. Soc. Am. 77: 671–677.

    Article  Google Scholar 

  • Green, K. P. (1987). The Perception of Speaking Rate Using Visual Information from a Talker’s Face. Perception and Psychophysics 42: 587–593.

    Article  Google Scholar 

  • Green, K. P. & Kuhl, P. K. (1989). The Role of Visual Information in the Processing of Place and Manner Features in Speech Perception. Perception and Psychophysics 45: 34–42.

    Article  Google Scholar 

  • Green, K. P. & Kuhl, P. K. (1991). Integral Processing of Visual Place and Auditory Voicing Information During Phonetic Perception. Journal of Experimental Psychology: Human Perception and Performance 17: 278–288.

    Article  Google Scholar 

  • Green, K. P. & Miller J. L. (1985). On the Role of Visual Rate Information in Phonetic Perception. Perception and Psychophysics 38: 269–276.

    Article  Google Scholar 

  • Hatwell, Y. (1993). Transferts intermodaux et intégration intermodale. In Richelle, M., Reguin, J. & Robert, M. (eds.) Traité de Psychologie Expérimentale. Presses Universitaires de France: Paris.

    Google Scholar 

  • King, A. J. & Palmer, A. R. (1985). Integration of Visual and Auditory Information in Bimodal Neurones in the Guinea-pig Superior Colliculus. Expl. Brain Res. 60: 492–500.

    Article  Google Scholar 

  • Klatt, D. H. (1979). Speech Perception: A Model of Acoustic-Phonetic Analysis and Lexical Access. J. of Phonetics 7: 279–312.

    Google Scholar 

  • Knudsen, E. I. (1982). Auditory and Visual Maps of Space in the Optic Tectum of the Owl. J. Neurosci. 2: 1177–1194.

    Google Scholar 

  • Konishi, M. (1986). Centrally Synthesized Maps of Sensory Space. Trends Neurosci 9: 163–168.

    Article  Google Scholar 

  • Kuhl, P. K. & Meltzoff, A. N. (1982). The Bimodal Perception of Speech in Infancy. Science 218: 1138–1141.

    Article  Google Scholar 

  • Kurita, T., Honda, K. & Kakita, Y. (1988). Analysis of Speech by a Joint Use of Image Processing of Lip Movements. I.E.I.C.E. Tech. Report, SP88–94, 41–48.

    Google Scholar 

  • Lallouache, M. T. (1991). Un poste ‘visage-parole’ couleur. Acquisition et traitement automatique des contours des lèvres. Doctoral Thesis, Institut National Polytechnique, Grenoble.

    Google Scholar 

  • Liberman, A. & Mattingly, I. (1985). The Motor Theory of Speech Perception Revised. Cognition 21: 1–33.

    Article  Google Scholar 

  • Lisker, L. & Rossi, M. (1993). Auditory and Visual Cueing of the [± Rounded] Feature of Vowels. Language and Speech 35: 391–417.

    Google Scholar 

  • Massaro, D. W. (1987). Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry. Lawrence Erlbaum Associates: London.

    Google Scholar 

  • Massaro, D. W. (1989). Testing Between the TRACE Model and the Fuzzy Logical Model of Speech Perception. Cognitive Psychology 21: 398–421.

    Article  Google Scholar 

  • Massaro, D. W. & Cohen, M. M. (1993). Perceiving Asynchronous Bimodal Speech in Consonant-Vowel and Vowel Syllables. Speech Communication 13: 127–134.

    Article  Google Scholar 

  • Massaro, D. W., Cohen, M. M., Gesi, A., Heredia, R. & Tsuzaki, M. (1993). Bimodal Speech Perception: An Examination Across Languages. J. of Phonetics 21: 445–478.

    Google Scholar 

  • Massaro, D. W. & Warner, D. S. (1977). Dividing Attention Between Auditory and Visual Perception. Perception and Psychophysics 21: 569–574.

    Article  Google Scholar 

  • Matsuoka, K., Furuya, T. & Kurosu, K. (1986). Speech Recognition by Image Processing of Lip Movements. Trans, on Soc. Instru. and Cont. Eng. 22: 191–198.

    Google Scholar 

  • McGurk, H. & MacDonald, J. (1976). Hearing Lips and Seeing Voices. Nature 264: 746–748.

    Article  Google Scholar 

  • McLeod, A. & Summerfield, Q. (1987). Quantifying the Contribution of Vision to Speech Perception in Noise. British Journal of Audiology 21: 131–141.

    Article  Google Scholar 

  • Meredith, M. A. & Stein, B. E. (1983). Interactions Among Converging Sensory Inputs in the Superior Colliculus. Science 221: 389–391.

    Article  Google Scholar 

  • Morris, A. C. (1992). Analyse informationnelle du traitement de la parole dans le système auditif périphérique et le noyau cochléaire. Doctoral Thesis, Institut National Polytechnique, Grenoble.

    Google Scholar 

  • Petajan, E. D. (1984). Automatic Lipreading to Enhance Speech Recognition. Proceeding of the Global Communications Conference, IEEE Communication Society, Atlanta, Georgia, 265–272.

    Google Scholar 

  • Radeau, M. (1994). Auditory-Visual Spatial Interaction and Modularity. Current Psychology of Cognition 13: 3–51.

    Google Scholar 

  • Reed, C. M., Rabinowitz, W. M., Durlach, N. I. & Braida, L. D. (1985). Research on the Tadoma method of speech communication. J. Acoust. Soc. Am. 77(1): 247–257.

    Article  Google Scholar 

  • Reisberg, D., McLean, J. & Goldfield, A. (1987). Easy to Hear but Hard to Understand: A Lipreading Advantage with Intact Auditory Stimuli. In Dodd, B. & Campbell, R. (eds.) Hearing by Eye: the Psychology of Lipreading, 97–113. Lawrence Erlbaum Associates: London.

    Google Scholar 

  • Risberg, A. & Lubker, J. L. (1978). Prosody and Speechreading. Speech Transmission Laboratory Quaterly Progress & Status Report, Stockholm, Vol. 4, 1–16.

    Google Scholar 

  • Robert-Ribes, J. (to appear). Models of Audiovisual Integration. Doctoral Thesis, Institut National Polytechnique, Grenoble.

    Google Scholar 

  • Robert-Ribes, J., Escudier, P. & Schwartz J. L. (1991). Modèles d’intégration audition-vision: une étude neuromimétique. Internal report, ICP, Grenoble.

    Google Scholar 

  • Robert-Ribes, J., Lallouache, T., Escudier, P. & Schwartz, J. L. (1993). Integrating Auditory and Visual Representations for Audiovisual Vowel Recognition. Proceedings of The Third European Conference on Speech Communication and Technology, 1753–1756 (by ESCA), Berlin Germany.

    Google Scholar 

  • Roberts, M. & Summerfield, Q. (1981). Audiovisual Presentation Demonstrates That Selective Adaptation in Speech Perception is Purely Auditory. Perception and Psychophysics 30: 309–314.

    Article  Google Scholar 

  • Rönnberg, J., Arlinger, S., Lyxell, B. & Kinnefords, C. (1989). Visual Evoked Potentials: Relation to Adult Speechreading and Cognitive Function. J. of Speech and Hearing Research 32: 725–735.

    Google Scholar 

  • Rosen, S., Fourcin, A. J. & Moore B. (1981). Voice Pitch as an Aid to Lipreading. Nature 291: 150–152.

    Article  Google Scholar 

  • Sams, M., Aulanko, R., Hämäläinen, M., Hari, R., Lounasmaa, O. V., Lu, S. & Simola, J. (1991). Seeing Speech: Visual Information from Lip Movements Modifies Activity in the Human Auditory Cortex. Neuroscience Letters 127: 141–145.

    Article  Google Scholar 

  • Schroeder, M. R., Atal, B. S. & Hall, J. L. (1979). Objective Measure of Certain Speech Signal Degradations Based on Masking Properties of Human Auditory Perception. In Lindblom, B. & Ohman, S. (eds.) Frontiers of Speech Communication Research, 217–229. Academic Press: London.

    Google Scholar 

  • Sekiyama, K. & Tokhura, Y. (1993). Inter-Language Differences in the Influence of Visual Cues in Speech Perception. J. of Phonetics 21: 427–444.

    Google Scholar 

  • Shepherd, D. C., DeLavergne, R. W., Frueh, F. X. & Clobridge, C. (1977). Visual-Neural Correlate of Speechreading Ability in Normal-Hearing Adults. Journal of Speech and Hearing Research 20: 752–765.

    Google Scholar 

  • Stein, B. E. & Meredith, M. A. (1993). The Merging of the Senses. MIT Press: Cambridge.

    Google Scholar 

  • Stork, D. G., Wolff, G. & Levine, E. (1992). Neural Network Lipreading System for Improved Speech Recognition. IJCNN-92, Baltimore MD, Vol. 2, 285–295.

    Google Scholar 

  • Sumby, W. H. & Pollack, I. (1954). Visual Contribution to Speech Intelligibility in Noise. J. Acoust. Soc. Am. 26: 212–215.

    Article  Google Scholar 

  • Summerfield, Q. (1979). Use of Visual Information for Phonetic Perception. Phonetica 36: 314–331.

    Article  Google Scholar 

  • Summerfield, Q. (1987). Some Preliminaries to a Comprehensive Account of Audio-Visual Speech Perception. In Dodd, B. & Campbell, R. (eds.) Hearing by Eye: The Psychology of Lipreading, 3–51. Lawrence Erlbaum Associates: London.

    Google Scholar 

  • Summerfield, Q. (1992). Lipreading and Audio-Visual Speech Perception. In Bruce, Cowey, Ellis & Peret (eds.) Proceeding the Facial Image, 71–78. Clarendon Press: Oxford.

    Google Scholar 

  • Summerfield, Q. & McGrath, M. (1984). Detection and Resolution of Audio-Visual Incompatibility in the Perception of Vowels. Quarterly Journal of Experimental Psychology: Human Experimental Psychology 36: 51–74.

    Google Scholar 

  • Tamura, S. (1989). Lip Contour Extraction-Complement and Tracing by Using Energy Function and Optical Flow. Paper of Technical Group on Pattern Recognition and Understanding, I.E.I.C.E., PRU89-20, 9–16.

    Google Scholar 

  • Vroomen, J. H. M. (1992). Hearing Voices and Seeing Lips: Investigations in the Psychology of Lipreading. Doctoral dissertation, Katolieke Univ. Brabant.

    Google Scholar 

  • Watanabe, T. & Kohda, M. (1990). Lip-Reading of Japanese Vowels Using Neural Networks. Proceedings of The Int. Conf. Spoken Lang. Proces. 90, Kobe, Japan, 1373–1376.

    Google Scholar 

  • Welch, R. B. (1989). A Comparison of Speech Perception and Spatial Localization. Behavioral and Brain Sciences 12: 776–777.

    Article  Google Scholar 

  • Welch, R. B. & Warren, D. H. (1986). Intersensory Interactions. In Boff, K. R., Kaufman, L. & Thomas, J. P. (eds.) Handbook of Perception and Human Performance, Volume I: Sensory Processes and Perception, 25-1–25-36. Wiley: New York.

    Google Scholar 

  • Yuhas, B. P., Goldstein, M. H. & Sejnowski, T. J. (1989). Integration of Acoustic and Visual Speech Signals Using Neural Networks. IEEE Communications Magazine 65–71.

    Google Scholar 

  • Yuhas, B. P., Goldstein, M. H., Sejnowski, T. J. & Jenkins, R. E. (1990). Neural Network Models of Sensory Integration for Improved Vowel Recognition. Proc. of the IEEE 78: 1658–1668.

    Article  Google Scholar 

  • Yuhas, B. P. & Goldstein, M. H. (1991). Comparing Human and Neural Network Lipreaders. J. Acoust. Soc. Am., 90: 598–600.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1995 Kluwer Academic Publishers

About this chapter

Cite this chapter

Robert-Ribes, J., Schwartz, JL., Escudier, P. (1995). A Comparison of Models for Fusion of the Auditory and Visual Sensors in Speech Perception. In: Mc Kevitt, P. (eds) Integration of Natural Language and Vision Processing. Springer, Dordrecht. https://doi.org/10.1007/978-94-009-1639-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-94-009-1639-5_7

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-0-7923-3944-1

  • Online ISBN: 978-94-009-1639-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics