Skip to main content
Log in

Exploration of Properly Combined Audiovisual Representation with the Entropy Measure in Audiovisual Speech Recognition

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Deep belief networks (DBNs) have shown impressive improvements over the Gaussian mixture models whilst are employed inside the Hidden Markov Model (HMM)-based automatic speech recognition systems. In this study, the benefits of the DBNs to be used in audiovisual speech recognition systems are investigated. First, the DBN-HMMs are explored in speech recognition and lip-reading tasks, separately. Next, the challenge of appropriately integrating the audio and visual information is studied; for this purpose, the application of the fused feature in an audiovisual (AV) DBN-HMM based speech recognition task is studied. With regard to the integration of information, those layers that provide generalities and details with together, so that in overall a completion is made, are selected. A modified technique is proposed based on the entropy of different layers of the used DBNs, to measure the amount of information. The best audio layer representation is found to have the highest entropy, with the highest power of providing information details in the fusion scheme. In contrast, the best visual layer representation is found to have the lowest entropy, which could best provide sufficient generalities. Over the CUAVE database, on English digit recognition task, the conducted experiments show that the AV DBN-HMM, with proposed feature fusion method, can reduce phone error rate by as much as 4% and 1.5%, and word error rate by about 3.49% and 1.89%, over the baseline conventional HMM and audio DBN-HMM, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. I. Almajai, S. Cox, R. Harvey, Y. Lan, Improved speaker independent lipreading using speaker adaptive training and deep neural networks, in Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 2722–2726

  2. E. Avots, T. Sapiński, M. Bachmann, D. Kamińska, Audiovisual emotion recognition in wild. Mach. Vis. Appl. 1–11 (2018). https://doi.org/10.1007/s00138-018-0960-9

  3. T.F. Cootes, C.J. Taylor, D.H. Cooper, J. Graham, Active shape models-their training and application. Comput. Vis. Image Underst. 61, 38–59 (1995)

    Article  Google Scholar 

  4. P. Duchnowski, U. Meier, A. Waibel, See me, hear me: integrating automatic speech recognition and lipreading, in Third International Conference on Spoken Language Processing (1994)

  5. S. Dupont, J. Luettin, Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimed. 2(3), 141–151 (2000)

    Article  Google Scholar 

  6. S. Gurbuz, Z. Tufekci, E. Patterson, J.N. Gowdy, Multi-stream product modal audio-visual integration strategy for robust adaptive speech recognition, in Proceedings of 2002 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2 (IEEE, 2002), pp. II–2021

  7. G. Hinton, L. Deng, D. Yu, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. IEEE Trans. Signal Process. 29(6), 82–97 (2012)

    Article  Google Scholar 

  8. G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  9. J. Huang, B. Kingsbury, Audio-visual deep learning for noise robust speech recognition, in Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2013), pp. 7596–7599

  10. D. Kamińska, T. Sapiński, G. Anbarjafari, Efficiency of chosen speech descriptors in relation to emotion recognition. EURASIP J. Audio Speech Music Process. 1, 3 (2017)

    Article  Google Scholar 

  11. M. Kim, J., Ryu, E., Kim, Speech recognition by integrating audio, visual and contextual features based on neural networks, in Advances in Natural Computation. LNCS (2005)

  12. T. Lewis, D. Powers, Audio-Visual Speech Recognition using Red Exclusion and Neural Networks, vol. 24(1) (Australian Computer Society Inc, Sydney, 2003), pp. 149–156

    Google Scholar 

  13. E. Marcheret, S. Chu, V. Goel, G. Potamianos, Efficient likelihood computation in multi-stream hmm based audio-visual speech recognition, in Eighth International Conference on Spoken Language Processing (ICSLP) (2004)

  14. U. Meier, W. Hurst, P. Duchnowski, Adaptive bimodal sensor fusion for automatic speech reading, in Proceedings of 1996 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2 (IEEE, 1996), pp. 833–836

  15. A. Mohamed, G. Hinton, G. Penn, Understanding how deep belief networks perform acoustic modelling, in Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2012), pp. 4273–4276

  16. A.-R. Mohamed, A.-R. Dahl, G. Hinton, Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20, 14–22 (2012)

    Article  Google Scholar 

  17. Y. Mroueh, E. Marcheret, V. Goel, Deep multimodal learning for audio-visual speech recognition, in Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 2130–2134

  18. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A.Y. Ng, Multimodal deep learning, in Proceedings of the 28th International Conference on Machine Learning (ICML-11) (2011) pp. 689–696

  19. F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, G. Anbarjafari, Audio-visual emotion recognition in video clips. IEEE Trans. Affect. Comput. 99, 1–17 (2017)

    Google Scholar 

  20. E.K. Patterson, S. Gurbuz, Z. Tufekci, J.N. Gowdy, Cuave: A new audio-visual database for multimodal human–computer interface research, in Proceedings of 2002 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2002), pp. II–2017

  21. G. Potamianos, C. Neti, G. Gravier, A. Garg, A.W. Senior, Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE 91(9), 1306–1326 (2003)

    Article  Google Scholar 

  22. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, The Kaldi speech recognition toolkit, in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, EPFL-CONF-192584 (IEEE Signal Processing Society, 2011)

  23. N. Srivastava, R.R. Salakhutdinov, Multimodal learning with deep Boltzmann machines. J. Mach. Learn. Res. 15, 2949–2980 (2014)

    MathSciNet  MATH  Google Scholar 

  24. W.H. Sumby, I. Pollack, Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26, 212–215 (1954)

    Article  Google Scholar 

  25. F. Vakhshiteh, F. Almasganj, A. Nickabadi, Lip-reading via deep neural networks using hybrid visual features. Image Anal. Stereol. 37(2), 159–171 (2018)

    Article  MATH  Google Scholar 

  26. F. Vakhshiteh, F. Almasganj, Lip-reading via deep neural network using appearance-based visual features, in 2017 24th National and 2nd International Iranian Conference on Biomedical Engineering (ICBME) (IEEE, 2017), pp. 1–6

  27. A. Verma, T. Faruquie, C. Neti, S. Basu, A. Senior, Late integration in audio-visual continuous speech recognition. Autom. Speech Recognit. Underst. 1, 71–74 (1999)

    Google Scholar 

  28. K. Veselý, A. Ghoshal, L. Burget, D. Povey, Sequence-discriminative training of deep neural networks, in INTERSPEECH (2013) pp. 2345–2349

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Farshad Almasganj.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vakhshiteh, F., Almasganj, F. Exploration of Properly Combined Audiovisual Representation with the Entropy Measure in Audiovisual Speech Recognition. Circuits Syst Signal Process 38, 2523–2543 (2019). https://doi.org/10.1007/s00034-018-0975-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-018-0975-5

Keywords

Navigation