Exploration of Properly Combined Audiovisual Representation with the Entropy Measure in Audiovisual Speech Recognition

Vakhshiteh, Fatemeh; Almasganj, Farshad

doi:10.1007/s00034-018-0975-5

Exploration of Properly Combined Audiovisual Representation with the Entropy Measure in Audiovisual Speech Recognition

Published: 09 November 2018

Volume 38, pages 2523–2543, (2019)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Fatemeh Vakhshiteh¹ &
Farshad Almasganj¹

234 Accesses
4 Citations
Explore all metrics

Abstract

Deep belief networks (DBNs) have shown impressive improvements over the Gaussian mixture models whilst are employed inside the Hidden Markov Model (HMM)-based automatic speech recognition systems. In this study, the benefits of the DBNs to be used in audiovisual speech recognition systems are investigated. First, the DBN-HMMs are explored in speech recognition and lip-reading tasks, separately. Next, the challenge of appropriately integrating the audio and visual information is studied; for this purpose, the application of the fused feature in an audiovisual (AV) DBN-HMM based speech recognition task is studied. With regard to the integration of information, those layers that provide generalities and details with together, so that in overall a completion is made, are selected. A modified technique is proposed based on the entropy of different layers of the used DBNs, to measure the amount of information. The best audio layer representation is found to have the highest entropy, with the highest power of providing information details in the fusion scheme. In contrast, the best visual layer representation is found to have the lowest entropy, which could best provide sufficient generalities. Over the CUAVE database, on English digit recognition task, the conducted experiments show that the AV DBN-HMM, with proposed feature fusion method, can reduce phone error rate by as much as 4% and 1.5%, and word error rate by about 3.49% and 1.89%, over the baseline conventional HMM and audio DBN-HMM, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Temporal Architecture for Audiovisual Speech Recognition

Aggregated Multimodal Bidirectional Recurrent Model for Audiovisual Speech Recognition

Bimodality Streams Integration for Audio-Visual Speech Recognition Systems

References

I. Almajai, S. Cox, R. Harvey, Y. Lan, Improved speaker independent lipreading using speaker adaptive training and deep neural networks, in Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 2722–2726
E. Avots, T. Sapiński, M. Bachmann, D. Kamińska, Audiovisual emotion recognition in wild. Mach. Vis. Appl. 1–11 (2018). https://doi.org/10.1007/s00138-018-0960-9
T.F. Cootes, C.J. Taylor, D.H. Cooper, J. Graham, Active shape models-their training and application. Comput. Vis. Image Underst. 61, 38–59 (1995)
Article Google Scholar
P. Duchnowski, U. Meier, A. Waibel, See me, hear me: integrating automatic speech recognition and lipreading, in Third International Conference on Spoken Language Processing (1994)
S. Dupont, J. Luettin, Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimed. 2(3), 141–151 (2000)
Article Google Scholar
S. Gurbuz, Z. Tufekci, E. Patterson, J.N. Gowdy, Multi-stream product modal audio-visual integration strategy for robust adaptive speech recognition, in Proceedings of 2002 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2 (IEEE, 2002), pp. II–2021
G. Hinton, L. Deng, D. Yu, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. IEEE Trans. Signal Process. 29(6), 82–97 (2012)
Article Google Scholar
G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006)
Article MathSciNet MATH Google Scholar
J. Huang, B. Kingsbury, Audio-visual deep learning for noise robust speech recognition, in Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2013), pp. 7596–7599
D. Kamińska, T. Sapiński, G. Anbarjafari, Efficiency of chosen speech descriptors in relation to emotion recognition. EURASIP J. Audio Speech Music Process. 1, 3 (2017)
Article Google Scholar
M. Kim, J., Ryu, E., Kim, Speech recognition by integrating audio, visual and contextual features based on neural networks, in Advances in Natural Computation. LNCS (2005)
T. Lewis, D. Powers, Audio-Visual Speech Recognition using Red Exclusion and Neural Networks, vol. 24(1) (Australian Computer Society Inc, Sydney, 2003), pp. 149–156
Google Scholar
E. Marcheret, S. Chu, V. Goel, G. Potamianos, Efficient likelihood computation in multi-stream hmm based audio-visual speech recognition, in Eighth International Conference on Spoken Language Processing (ICSLP) (2004)
U. Meier, W. Hurst, P. Duchnowski, Adaptive bimodal sensor fusion for automatic speech reading, in Proceedings of 1996 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2 (IEEE, 1996), pp. 833–836
A. Mohamed, G. Hinton, G. Penn, Understanding how deep belief networks perform acoustic modelling, in Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2012), pp. 4273–4276
A.-R. Mohamed, A.-R. Dahl, G. Hinton, Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20, 14–22 (2012)
Article Google Scholar
Y. Mroueh, E. Marcheret, V. Goel, Deep multimodal learning for audio-visual speech recognition, in Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 2130–2134
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A.Y. Ng, Multimodal deep learning, in Proceedings of the 28th International Conference on Machine Learning (ICML-11) (2011) pp. 689–696
F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, G. Anbarjafari, Audio-visual emotion recognition in video clips. IEEE Trans. Affect. Comput. 99, 1–17 (2017)
Google Scholar
E.K. Patterson, S. Gurbuz, Z. Tufekci, J.N. Gowdy, Cuave: A new audio-visual database for multimodal human–computer interface research, in Proceedings of 2002 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2002), pp. II–2017
G. Potamianos, C. Neti, G. Gravier, A. Garg, A.W. Senior, Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE 91(9), 1306–1326 (2003)
Article Google Scholar
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, The Kaldi speech recognition toolkit, in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, EPFL-CONF-192584 (IEEE Signal Processing Society, 2011)
N. Srivastava, R.R. Salakhutdinov, Multimodal learning with deep Boltzmann machines. J. Mach. Learn. Res. 15, 2949–2980 (2014)
MathSciNet MATH Google Scholar
W.H. Sumby, I. Pollack, Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26, 212–215 (1954)
Article Google Scholar
F. Vakhshiteh, F. Almasganj, A. Nickabadi, Lip-reading via deep neural networks using hybrid visual features. Image Anal. Stereol. 37(2), 159–171 (2018)
Article MATH Google Scholar
F. Vakhshiteh, F. Almasganj, Lip-reading via deep neural network using appearance-based visual features, in 2017 24th National and 2nd International Iranian Conference on Biomedical Engineering (ICBME) (IEEE, 2017), pp. 1–6
A. Verma, T. Faruquie, C. Neti, S. Basu, A. Senior, Late integration in audio-visual continuous speech recognition. Autom. Speech Recognit. Underst. 1, 71–74 (1999)
Google Scholar
K. Veselý, A. Ghoshal, L. Burget, D. Povey, Sequence-discriminative training of deep neural networks, in INTERSPEECH (2013) pp. 2345–2349

Download references

Author information

Authors and Affiliations

Department of Biomedical Engineering, Amirkabir University of Technology, 424 Hafez Ave, Tehran, Iran
Fatemeh Vakhshiteh & Farshad Almasganj

Authors

Fatemeh Vakhshiteh
View author publications
You can also search for this author in PubMed Google Scholar
Farshad Almasganj
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Farshad Almasganj.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vakhshiteh, F., Almasganj, F. Exploration of Properly Combined Audiovisual Representation with the Entropy Measure in Audiovisual Speech Recognition. Circuits Syst Signal Process 38, 2523–2543 (2019). https://doi.org/10.1007/s00034-018-0975-5

Download citation

Received: 24 March 2018
Revised: 24 October 2018
Accepted: 26 October 2018
Published: 09 November 2018
Issue Date: 15 June 2019
DOI: https://doi.org/10.1007/s00034-018-0975-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploration of Properly Combined Audiovisual Representation with the Entropy Measure in Audiovisual Speech Recognition

Abstract

Access this article

Similar content being viewed by others

Deep Temporal Architecture for Audiovisual Speech Recognition

Aggregated Multimodal Bidirectional Recurrent Model for Audiovisual Speech Recognition

Bimodality Streams Integration for Audio-Visual Speech Recognition Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploration of Properly Combined Audiovisual Representation with the Entropy Measure in Audiovisual Speech Recognition

Abstract

Access this article

Similar content being viewed by others

Deep Temporal Architecture for Audiovisual Speech Recognition

Aggregated Multimodal Bidirectional Recurrent Model for Audiovisual Speech Recognition

Bimodality Streams Integration for Audio-Visual Speech Recognition Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation