Abstract
Deep belief networks (DBNs) have shown impressive improvements over the Gaussian mixture models whilst are employed inside the Hidden Markov Model (HMM)-based automatic speech recognition systems. In this study, the benefits of the DBNs to be used in audiovisual speech recognition systems are investigated. First, the DBN-HMMs are explored in speech recognition and lip-reading tasks, separately. Next, the challenge of appropriately integrating the audio and visual information is studied; for this purpose, the application of the fused feature in an audiovisual (AV) DBN-HMM based speech recognition task is studied. With regard to the integration of information, those layers that provide generalities and details with together, so that in overall a completion is made, are selected. A modified technique is proposed based on the entropy of different layers of the used DBNs, to measure the amount of information. The best audio layer representation is found to have the highest entropy, with the highest power of providing information details in the fusion scheme. In contrast, the best visual layer representation is found to have the lowest entropy, which could best provide sufficient generalities. Over the CUAVE database, on English digit recognition task, the conducted experiments show that the AV DBN-HMM, with proposed feature fusion method, can reduce phone error rate by as much as 4% and 1.5%, and word error rate by about 3.49% and 1.89%, over the baseline conventional HMM and audio DBN-HMM, respectively.
Similar content being viewed by others
References
I. Almajai, S. Cox, R. Harvey, Y. Lan, Improved speaker independent lipreading using speaker adaptive training and deep neural networks, in Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 2722–2726
E. Avots, T. Sapiński, M. Bachmann, D. Kamińska, Audiovisual emotion recognition in wild. Mach. Vis. Appl. 1–11 (2018). https://doi.org/10.1007/s00138-018-0960-9
T.F. Cootes, C.J. Taylor, D.H. Cooper, J. Graham, Active shape models-their training and application. Comput. Vis. Image Underst. 61, 38–59 (1995)
P. Duchnowski, U. Meier, A. Waibel, See me, hear me: integrating automatic speech recognition and lipreading, in Third International Conference on Spoken Language Processing (1994)
S. Dupont, J. Luettin, Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimed. 2(3), 141–151 (2000)
S. Gurbuz, Z. Tufekci, E. Patterson, J.N. Gowdy, Multi-stream product modal audio-visual integration strategy for robust adaptive speech recognition, in Proceedings of 2002 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2 (IEEE, 2002), pp. II–2021
G. Hinton, L. Deng, D. Yu, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. IEEE Trans. Signal Process. 29(6), 82–97 (2012)
G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006)
J. Huang, B. Kingsbury, Audio-visual deep learning for noise robust speech recognition, in Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2013), pp. 7596–7599
D. Kamińska, T. Sapiński, G. Anbarjafari, Efficiency of chosen speech descriptors in relation to emotion recognition. EURASIP J. Audio Speech Music Process. 1, 3 (2017)
M. Kim, J., Ryu, E., Kim, Speech recognition by integrating audio, visual and contextual features based on neural networks, in Advances in Natural Computation. LNCS (2005)
T. Lewis, D. Powers, Audio-Visual Speech Recognition using Red Exclusion and Neural Networks, vol. 24(1) (Australian Computer Society Inc, Sydney, 2003), pp. 149–156
E. Marcheret, S. Chu, V. Goel, G. Potamianos, Efficient likelihood computation in multi-stream hmm based audio-visual speech recognition, in Eighth International Conference on Spoken Language Processing (ICSLP) (2004)
U. Meier, W. Hurst, P. Duchnowski, Adaptive bimodal sensor fusion for automatic speech reading, in Proceedings of 1996 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2 (IEEE, 1996), pp. 833–836
A. Mohamed, G. Hinton, G. Penn, Understanding how deep belief networks perform acoustic modelling, in Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2012), pp. 4273–4276
A.-R. Mohamed, A.-R. Dahl, G. Hinton, Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20, 14–22 (2012)
Y. Mroueh, E. Marcheret, V. Goel, Deep multimodal learning for audio-visual speech recognition, in Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 2130–2134
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A.Y. Ng, Multimodal deep learning, in Proceedings of the 28th International Conference on Machine Learning (ICML-11) (2011) pp. 689–696
F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, G. Anbarjafari, Audio-visual emotion recognition in video clips. IEEE Trans. Affect. Comput. 99, 1–17 (2017)
E.K. Patterson, S. Gurbuz, Z. Tufekci, J.N. Gowdy, Cuave: A new audio-visual database for multimodal human–computer interface research, in Proceedings of 2002 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2002), pp. II–2017
G. Potamianos, C. Neti, G. Gravier, A. Garg, A.W. Senior, Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE 91(9), 1306–1326 (2003)
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, The Kaldi speech recognition toolkit, in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, EPFL-CONF-192584 (IEEE Signal Processing Society, 2011)
N. Srivastava, R.R. Salakhutdinov, Multimodal learning with deep Boltzmann machines. J. Mach. Learn. Res. 15, 2949–2980 (2014)
W.H. Sumby, I. Pollack, Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26, 212–215 (1954)
F. Vakhshiteh, F. Almasganj, A. Nickabadi, Lip-reading via deep neural networks using hybrid visual features. Image Anal. Stereol. 37(2), 159–171 (2018)
F. Vakhshiteh, F. Almasganj, Lip-reading via deep neural network using appearance-based visual features, in 2017 24th National and 2nd International Iranian Conference on Biomedical Engineering (ICBME) (IEEE, 2017), pp. 1–6
A. Verma, T. Faruquie, C. Neti, S. Basu, A. Senior, Late integration in audio-visual continuous speech recognition. Autom. Speech Recognit. Underst. 1, 71–74 (1999)
K. Veselý, A. Ghoshal, L. Burget, D. Povey, Sequence-discriminative training of deep neural networks, in INTERSPEECH (2013) pp. 2345–2349
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Vakhshiteh, F., Almasganj, F. Exploration of Properly Combined Audiovisual Representation with the Entropy Measure in Audiovisual Speech Recognition. Circuits Syst Signal Process 38, 2523–2543 (2019). https://doi.org/10.1007/s00034-018-0975-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-018-0975-5