Evolving Fuzzy-Neural Method for Multimodal Speech Recognition

  • Mario Malcangi
  • Philip GrewEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 517)


Improving automatic speech recognition systems is one of the hottest topics in speech-signal processing, especially if such systems are to operate in noisy environments. This paper proposes a multimodal evolutionary neuro-fuzzy approach to developing an automatic speech-recognition system. To make inferences at the decision stage about audiovisual information for speech-to-text conversion, the EFuNN paradigm was applied. Two independent feature extractors were developed, one for the speech phonetics (speech listening) and the other for the speech visemics (lip reading). The EFuNN network has been trained to fuse decisions on audio and decisions on video. This soft computing approach proved robust in harsh conditions and, at the same time, less complex than hard computing, pattern-matching methods. Preliminary experiments confirm the reliability of the proposed method for developing a robust, automatic, speech-recognition system.


Audiovisual Speech Recognition (AVSR) Evolutionary Fuzzy Neural Network (EFuNN) Speech-To-Text (STT) Decision fusion Multimodal speech recognition 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kölsch, M., Bane, R., Höllerer, T., Turk, M.: Touching the Visualized Invisible: Wearable AR with a Multimodal Interface. IEEE Computer Graphics and Applications, May/June 2006Google Scholar
  2. 2.
    Marshall, J., Tennent, P.: Mobile interaction does not exist. In: CHI 2013 Extended Abstracts on Human Factors in Computing Systems (CHI EA 2013), pp. 2069–2078. ACM, New York (2013)Google Scholar
  3. 3.
    Basu, S., Neti, C., Senior, A., Rajput, N., Subramaniam, A., Verma, A.: Audio-visual large vocabulary continuous speech recognition in the broadcast domain. In: IEEE Workshop on Multimedia Signal Processing, pp. 475–481 (1999)Google Scholar
  4. 4.
    Massaro, D.W.: Bimodal speech perception: a progress report. In: Stork, D.G., Hennecke, M.E. (eds.) Speechreading by Humans and Machines: Models, Systems, and Applications, pp. 79–101. Springer, New York (1996)CrossRefGoogle Scholar
  5. 5.
    Benoît, C., Guiard-Marigny, T., Le Goff, B., Adjoudani, A.: Which components of the face do humans and machines best speechread? In: Stork, D.G., Hennecke, M.E. (eds.) Speechreading by Humans and Machines: Models, Systems, and Applications, pp. 315–328. Springer, New York (1996)CrossRefGoogle Scholar
  6. 6.
    Salama, E.S., El-khoribi, R.A., Shoman, M.E.: Audio-Visual Speech Recognition for People with Speech Disorders. International Journal of Computer Applications 96(2), 51–56 (2014)CrossRefGoogle Scholar
  7. 7.
    Bernstein, L.E., Auer Jr, E.T.: Word Recognition in Speechreading. In: Stork, D.G., Hennecke, M.E. (eds.) Speechreading by Humans and Machines: Models, Systems, and Applications, pp. 17–26. Springer, New York (1996)CrossRefGoogle Scholar
  8. 8.
    Kaucic, R., Dalton, R., Blake, A.: Real-time lip tracking for audio-visual speech recognition applications. In: Buxton, B.F., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1065, pp. 376–387. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  9. 9.
    Yang, J., Waibel, A.: A real-time face tracker. In: Proc. WACV, pp. 142–147 (1996)Google Scholar
  10. 10.
    Steifelhagen, R., Yang, J., Meier, U.: Real time lip tracking for lipreading. In: Proceedings of Eurospeech (1997)Google Scholar
  11. 11.
    Malcangi, M., de Tintis, R.: Audio based real-time speech animation of embodied conversational agents. In: Camurri, A., Volpe, G. (eds.) GW 2003. LNCS (LNAI), vol. 2915, pp. 350–360. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  12. 12.
    McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)CrossRefGoogle Scholar
  13. 13.
    Wright, D., Wareham, G.: Mixing sound and vision: The interaction of auditory and visual information for earwitnesses of a crime scene. Legal and Criminological Psychology 10(1), 103–108 (2005)CrossRefGoogle Scholar
  14. 14.
    Noda, K., Yamaguchi, Y., Hiroshi, K., Okuno, G., Ogata, T.: Audio-visual speech recognition using deep learning. Applied Intelligence 42(4), 722–737 (2015). SpringerCrossRefGoogle Scholar
  15. 15.
    Malcangi, M., Ouazzane, K., Patel, P.: Audio-visual fuzzy fusion for robust speech recognition. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 582–589 (2013)Google Scholar
  16. 16.
    Patel, P., Ouazzane, K., Whitrow, R.: Automated visual feature extraction for bimodal speech recognition. In: Proceedings of IADAT-micv 2005, pp. 118–122 (2005)Google Scholar
  17. 17.
    Stork, D.G., Wolff, G.J., Levine, E.P.: Neural network lipreading system for improved speech recognition. In: Proceedings International Joint Conf. on Neural Networks, vol. 2, pp. 289–295 (1992)Google Scholar
  18. 18.
    Joo, M.G.: A method of converting conventional fuzzy logic system to 2 layered hierarchical fuzzy system. In: Proceedings of the IEEE International Conference on Fuzzy Systems, pp. 1357–1362 (2003)Google Scholar
  19. 19.
    Kasabov, N.: Evolving fuzzy neural networks – algorithms, applications and biological motivation. In: Yamakawa, T. and Matsumoto, G. (eds.) Methodologies for the conception, design and application of the soft computing, World Computing, pp. 271–274 (1998)Google Scholar
  20. 20.
    Kasabov, N.: EFuNN, IEEE Tr SMC (2001)Google Scholar
  21. 21.
    Patel, P., Ouazzane, K.: Validation and performance evaluation of an automated visual feature extraction algorithm. In: proceedings of IMVIP 2006, pp. 68–73 (2006)Google Scholar
  22. 22.
    Malcangi, M.: Softcomputing Approach to segmentation of Speech in Phonetic Units. International Journal of Computer and Communications 3(3), 41–48 (2009)Google Scholar
  23. 23.
    Cappelletta, L., Harte, H.: Phoneme-to-viseme mapping for visual speech recognition. In: Proceedings of the International Conference on Pattern Recognition Applications and Methods, pp. 322–329 (2012)Google Scholar
  24. 24.
    Kohonen, T.: The ‘Neural’ Phonetic Typewriter. Computer 21(3), 11–22 (1988)CrossRefGoogle Scholar
  25. 25.
    Kasabov, N., Song, Q.: DENFIS: Dynamic Evolving Neural-Fuzzy Inference System and Its Application for Time-Series Prediction. IEEE Trans. on Fuzzy Systems 10, 144–154 (2002)CrossRefGoogle Scholar
  26. 26.

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversità degli Studi di MilanoMilanItaly

Personalised recommendations