Improving Audio-Visual Speech Recognition Using Gabor Recurrent Neural Networks

  • Ali S. Saudi
  • Mahmoud I. Khalil
  • Hazem M. AbbasEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11377)


The performance of speech recognition systems can be significantly improved when visual information is used in conjunction with the audio ones, especially in noisy environments. Prompted by the great achievements of deep learning in solving Audio-Visual Speech Recognition (AVSR) problems, we propose a deep AVSR model based on Long Short-Term Memory Bidirectional Recurrent Neural Network (LSTM-BRNN). The proposed deep AVSR model utilizes the Gabor filters in both the audio and visual front-ends with Early Integration (EI) scheme. This model is termed as BRNN\(_{av}\) model. The Gabor features simulate the underlying spatiotemporal processing chain that occurs in the Primary Audio Cortex (PAC) in conjunction with Primary Visual Cortex (PVC). We named it Gabor Audio Features (GAF) and Gabor Visual Features (GVF). The experimental results show that the deep Gabor (LSTM-BRNNs)-based model achieves superior performance when compared to the (GMM-HMM)-based models which utilize the same front-ends. Furthermore, the use of GAF and GVF in both audio and visual front-ends attain significant improvement in the performance compared to the traditional audio and visual features.


Audio-Visual Speech Recognition Bidirectional Recurrent Neural Network Gabor filters 


  1. 1.
    Chang, S.Y., Morgan, N.: Robust CNN-based speech recognition with Gabor filter kernels. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)Google Scholar
  2. 2.
    Feng, W., Guan, N., Li, Y., Zhang, X., Luo, Z.: Audio visual speech recognition with multimodal recurrent neural networks. In: IEEE International Joint Conference on Neural Networks (IJCNN), pp. 681–688. IEEE (2017)Google Scholar
  3. 3.
    Gabor, D.: Theory of communication. Part 1: the analysis of information. J. Inst. Electr. Eng. Part III Radio Commun. Eng. 93(26), 429–441 (1946)Google Scholar
  4. 4.
    Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)Google Scholar
  5. 5.
    Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: International Conference on Machine Learning, pp. 1764–1772 (2014)Google Scholar
  6. 6.
    Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition. arXiv:1412.5567 (2014)
  7. 7.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  8. 8.
    Luan, S., Chen, C., Zhang, B., Han, J., Liu, J.: Gabor convolutional networks. IEEE Trans. Image Process. 27, 4357–4366 (2018)MathSciNetCrossRefGoogle Scholar
  9. 9.
    McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)CrossRefGoogle Scholar
  10. 10.
    Mesgarani, N., David, S., Shamma, S.: Representation of phonemes in primary auditory cortex: how the brain analyzes speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 765–768. IEEE (2007)Google Scholar
  11. 11.
    Meshgini, S., Aghagolzadeh, A., Seyedarabi, H.: Face recognition using Gabor-based direct linear discriminant analysis and support vector machine. Comput. Electr. Eng. 39(3), 727–745 (2013)CrossRefGoogle Scholar
  12. 12.
    Nakamura, S.: Statistical multimodal integration for audio-visual speech processing. IEEE Trans. Neural Netw. 13(4), 854–866 (2002)CrossRefGoogle Scholar
  13. 13.
    Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Audio-visual speech recognition using deep learning. Appl. Intell. 42(4), 722–737 (2015)CrossRefGoogle Scholar
  14. 14.
    Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Orlando, Florida, USA, vol. 2, pp. II-2017–II-2020. IEEE (2002)Google Scholar
  15. 15.
    Petridis, S., Li, Z., Pantic, M.: End-to-end visual speech recognition with LSTMs. arXiv:1701.05847 (2017)
  16. 16.
    Petridis, S., Pantic, M.: Deep complementary bottleneck features for visual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2304–2308. IEEE (2016)Google Scholar
  17. 17.
    Shen, L., Bai, L., Fairhurst, M.: Gabor wavelets and general discriminant analysis for face identification and verification. Image Vis. Comput. 25(5), 553–563 (2007)CrossRefGoogle Scholar
  18. 18.
    Thanda, A., Venkatesan, S.M.: Audio visual speech recognition using deep recurrent neural networks. In: Schwenker, F., Scherer, S. (eds.) MPRSS 2016. LNCS (LNAI), vol. 10183, pp. 98–109. Springer, Cham (2017). Scholar
  19. 19.
    Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)CrossRefGoogle Scholar
  20. 20.
    Zhang, B., Gao, Y., Zhao, S., Liu, J.: Local derivative pattern versus local binary pattern: face recognition with high-order local pattern descriptor. IEEE Trans. Image Process. 19(2), 533–544 (2010)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Ali S. Saudi
    • 1
  • Mahmoud I. Khalil
    • 2
  • Hazem M. Abbas
    • 2
    Email author
  1. 1.Faculty of Media Engineering and TechnologyGerman University in CairoNew CairoEgypt
  2. 2.Faculty of EngineeringAin Shams UniversityCairoEgypt

Personalised recommendations