Evolving Systems

, Volume 8, Issue 1, pp 85–94 | Cite as

Evolving connectionist method for adaptive audiovisual speech recognition

Original Paper


Reliability is the primary requirement in noisy conditions and for highly variable utterances. Integrating the recognition of visual signals with the recognition of audio signals is indispensable for many applications that require automatic speech recognition (ASR) in harsh conditions. Several important experiments have shown that integrating and adapting to multiple behavioral end-context information during the speech-recognition task significantly improves its success rate. By integrating audio and visual data from speech information, we can improve the performance of an ASR system by differentiating between the most critical cases of phonetic-unit mismatch that occur when processing audio or visual input alone. The evolving fuzzy neural-network (EFuNN) inference method is applied at the decision layer to accomplish this task. This is done through a paradigm that adapts to the environment by changing structure. The EFuNN’s capacity to learn quickly from incoming data and to adapt while on line lowers the ASR system’s complexity and enhances its performance in harsh conditions. Two independent feature extractors were developed, one for speech phonetics (listening to the speech) and the other for speech visemics (lip-reading the spoken input). The EFuNN network was trained to fuse decisions made disjointly by the audio unit and the visual unit. Our experiments have confirmed that the proposed method is reliable for developing a robust, automatic, speech-recognition system.


Audiovisual speech recognition (AVSR) Evolving fuzzy neural network (EFuNN) Speech-to-text (STT) Decision fusion Multimodal speech recognition 


  1. Badura S, Frátrik M, Škvarek O, Klimo M (2014) Bimodal vowel recognition using fuzzy logic networks - naive approach. ELEKTRO, 2014, pp. 22–25, IEEEGoogle Scholar
  2. Basu S, Neti C, Senior A, Rajput N, Subramaniam A, Verma A (1999) Audio-visual large vocabulary continuous speech recognition in the broadcast domain. In: IEEE Workshop on Multimedia Signal Processing. pp. 475–481Google Scholar
  3. Benoît C, Guiard-Marigny T, Le Goff B, Adjoudani A (1996) Which components of the face do humans and machines best speechread? In: Stork DG, Hennecke ME (eds) speechreading by humans and machines: models, systems, and applications. Springer-Verlag, New York, pp 315–328CrossRefGoogle Scholar
  4. Bernstein LE, Auer ET Jr (1996) Word Recognition in Speechreading. In: Stork DG, Hennecke ME (eds) In speechreading by humans and machines: models, systems, and applications. Springer-Verlag, New York, pp 17–26CrossRefGoogle Scholar
  5. Cappelletta L, Harte N (2012) Phoneme-to viseme mapping for visual speech recognition. Proceeding of the 2012 International Conference on Pattern Recognition Applications and Methods (ICPRAM 2012). (2012)Google Scholar
  6. Dupont S, Luettin J (2000) Audio-visual speech modeling for continuous speech recognition. IEEE Trans Multimedia 2(3):141–151CrossRefGoogle Scholar
  7. http://www.kedri.aut.ac.nz. Accessed 6 July 2016
  8. Joo MG (2003) A method of converting conventional fuzzy logic system to 2 layered hierarchical fuzzy system. In: Proceedings of the IEEE International Conference on Fuzzy Systems, pp. 1357–1362Google Scholar
  9. Kasabov N (1998) Evolving fuzzy neural networks—algorithms, applications and biological motivation. In: Yamakawa and Matsumoto (eds), Methodologies for the conception, design and application of soft computing, World Computing, pp. 271–274Google Scholar
  10. Kasabov N (2001) Evolving fuzzy neural networks for online, adaptive, knowledge-based learning. IEEE Trans Syst Man Cybern B 31(6):902–918CrossRefGoogle Scholar
  11. Kasabov N, Postma K, Van den Herik EJ (2000) AVIS: a connectionist-based framework for integrated auditory and visual information processing. Info Sci J 123(1–2):127–148CrossRefMATHGoogle Scholar
  12. Kaucic R, Dalton R, Blake A (1996) Real-time lip tracking for audio-visual speech recognition applications. Proc Eur Conf Comput Vision II:376–387Google Scholar
  13. Kölsch M, Bane R, Höllerer T, Turk M (2006) Touching the visualized invisible: wearable AR with a multimodal interface. IEEE Computer Graphics and Applications, May/June 2006Google Scholar
  14. Malcangi M, de Tintis R (2004) Audio based real-time speech animation of embodied conversational agents. In: Lecture Notes in Computer Science, Vol. 2915, Gesture-based communication in human-computer interaction, pp. 429–430Google Scholar
  15. Malcangi M, Ouazzane K, Patel P (2013) Audio-visual fuzzy fusion for robust speech recognition. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 582–589Google Scholar
  16. Marshall J, Tennent P (2013) “Mobile interaction does not exist” in chi ‘13 extended abstracts on human factors in computing systems (CHI EA ‘13). ACM, New York, pp 2069–2078Google Scholar
  17. Massaro DW (1996) Bimodal Speech Perception: A Progress Report. In: Stork DG, Hennecke ME (eds) In speechreading by humans and machines: models, systems, and applications. Springer-Verlag, New York, pp 79–101CrossRefGoogle Scholar
  18. McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748CrossRefGoogle Scholar
  19. Noda K, Yamaguchi Y, Hiroshi K, Okuno G, Ogata T (2015a) Audio-visual speech recognition using deep learning. Appl Intell Springer 42(4):722–737CrossRefGoogle Scholar
  20. Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2015b) An asynchronous DBN for audio-visual speech recognition. Applied Intelligence 42(4):722–737CrossRefGoogle Scholar
  21. Patel P, Ouazzane K, Whitrow R (2005) Automated visual feature extraction for bimodal speech recognition. In: Proceedings of IADAT-micv2005. pp. 118–122Google Scholar
  22. Petajan ED (1984) Automatic lipreading to enhance speech recognition. In: IEEE Communication Society Global Telecommunications ConferenceGoogle Scholar
  23. Salama ES, El-khoribi RA, Shoman ME (2014) Audio-visual speech recognition for people with speech disorders. Int J Comput Appl 96(2):51–56Google Scholar
  24. Sinclair S, Watson C (1995) The development of the Otago speech database. In: Kasabov K, Coghill G (eds). Proceedings of ANNES’95, IEEE Computer Society PressGoogle Scholar
  25. Steifelhagen R, Yang J, Meier U (1997) Real time lip tracking for lipreading. In: Proceedings of EurospeechGoogle Scholar
  26. Stork DG, Wolff GJ, Levine EP (1992) Neural network lipreading system for improved speech recognition. In Proceedings International Joint Conf. on Neural Networks, vol. 2, pp. 289–295Google Scholar
  27. Watts MJ (2009) A decade of Kasabov’s Evolving Connectionist Systems: a Review. IEEE Transactions on Systems, Man and Cybernetics Part C. Applications and Reviews (2009) 39(3):253–269MathSciNetGoogle Scholar
  28. Watts M, Kasabov N (2000) Simple evolving connectionist systems and experiments on isolated phoneme recognition. In: Proceedings of the first IEEE conference on evolutionary computation and neural networks, San Antonio, pp. 232–239, IEEE PressGoogle Scholar
  29. Wright D, Wareham G (2005) Mixing sound and vision: the interaction of auditory and visual information for earwitnesses of a crime scene. Legal Criminol Psychol 10(1):103–108CrossRefGoogle Scholar
  30. Yang J, Waibel A (1996) A real-time face tracker. In: Proc WACV. pp. 142–147Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversità degli Studi di MilanoMilanItaly

Personalised recommendations