Combining audio and visual speech recognition using LSTM and deep convolutional neural network

Shashidhar, R.; Patilkulkarni, S.; Puneeth, S. B.

doi:10.1007/s41870-022-00907-y

Combining audio and visual speech recognition using LSTM and deep convolutional neural network

Original Research
Published: 24 February 2022

Volume 14, pages 3425–3436, (2022)
Cite this article

International Journal of Information Technology Aims and scope Submit manuscript

620 Accesses
24 Citations
Explore all metrics

Abstract

Human speech is bimodal, whereas audio speech relates to the speaker's acoustic waveform. Lip motions are referred to as visual speech. Audiovisual Speech Recognition is one of the emerging fields of research, particularly when audio is corrupted by noise. In the proposed AVSR system, a custom dataset was designed for English Language. Mel Frequency Cepstral Coefficients technique was used for audio processing and the Long Short-Term Memory (LSTM) method for visual speech recognition. Finally, integrate the audio and visual into a single platform using a deep neural network. From the result, it was evident that the accuracy was 90% for audio speech recognition, 71% for visual speech recognition, and 91% for audiovisual speech recognition, the result was better than the existing approaches. Ultimately model was skilled at enchanting many suitable decisions while forecasting the spoken word for the dataset that was used.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Audio Visual Speech Recognition Using Deep Recurrent Neural Networks

Improving Audio-Visual Speech Recognition Using Gabor Recurrent Neural Networks

Audio-visual speech recognition using deep learning

Article Open access 20 December 2014

Availability of data and material

The data that support the findings of this study are available from the corresponding author on reasonable request.

Code availability

The code available on reasonable request to the corresponding author.

References

Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A (2018) Deep audio- visual speech recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2018.2889052
Article Google Scholar
Ivo I (2011) Speech and language technologies, pp 285–289, https://doi.org/10.5772/938
Shaikh AA, Kumar DK (2011) Visual speech recognition using optical flow and support vector machines. Int J Comput Intell Appl 10:171. https://doi.org/10.1142/S1469026811003045
Article Google Scholar
Stafylakis T, Tzimiropoulos G (2017) Combining residual networks with LSTMs for lip reading https://doi.org/10.21437/Interspeech.2017-85
Shillingford B, Assael YM, Hoffman MW, Paine TL, Hughes C, Prabhu U, Liao H, Sak H, Rao K, Bennett L, Mulville M, Coppin B, Laurie B, Senior AW, Freitas ND (2019) Large-scale visual speech recognition. arXiv: abs/1807.05162
Courtney L, Sreenivas R (2019) Learning from videos with deep convolutional LSTM networks. arXiv preprint. arXiv: 1904.04817.
Sterpu G, Saam C, Harte N (2018) Can DNNs learn to lipread full sentences? In: 25th IEEE International Conference on Image Processing (ICIP), Athens, 2018, pp 16–20, https://doi.org/10.1109/ICIP.2018.8451388
Kumar Y, Jain R, Salik K, Shah RR, Yin Y, Zimmermann R (2019) Lipper: synthesizing thy speech using multi-view lipreading. In: Proceedings of the AAAI Conference on artificial intelligence. 33: 2588–2595, https://doi.org/10.1609/aaai.v33i01.33012588
Xu K, Li D, Cassimatis N, Wang X (2018) LCANet: End-to-end lipreading with cascaded attention-CTC. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, 2018, pp 548–5558, https://doi.org/10.1109/FG.2018.00088
Margam D, Aralikatti R, Sharma T, Thanda A, Pujitha K, Roy S, Venkatesan S (2019) LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models. Published in ArXiv 2019. https://dblp.org/db/journals/corr/corr1906.html#abs-1906-12170. Accessed Jan 2021
Stafylakis T, Khan MH, Tzimiropoulos G (2018) Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs. Comput Vis Image Underst Vol 176–177:22–32. https://doi.org/10.1016/j.cviu.2018.10.003
Article Google Scholar
Zhang S, Lei M, Ma B, Xie L (2019) Robust audio-visual speech recognition using bimodal Dfsmn with multi-condition training and dropout regularization. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp 6570–6574, https://doi.org/10.1109/ICASSP.2019.8682566
Noda K, Yamaguchi Y, Nakadai K, Okuno H, Ogata T (2014) Audio-visual speech recognition using deep learning. Appl Intell 42:722–737. https://doi.org/10.1007/s10489-014-0629-7
Article Google Scholar
Petridis S, Li Z, Pantic M (2017) End-to-end visual speech recognition with LSTMS. In: IEEE International Conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, 2017, pp 2592–2596, https://doi.org/10.1109/ICASSP.2017.7952625
Tao F, Busso C (2018) Gating neural network for large vocabulary audiovisual speech recognition. IEEE/ACM Trans Audio Speech Lang Process 26(7):1290–1302. https://doi.org/10.1109/TASLP.2018.2815268
Article Google Scholar
Petridis S, Stafylakis T, Ma P, Tzimiropoulos G, Pantic M (2018) Audio-visual speech recognition with a hybrid CTC/attention architecture. In: IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 2018, pp 513–520, https://doi.org/10.1109/SLT.2018.8639643
Goh Y, Lau K, Lee Y (2019) Audio-visual speech recognition system using recurrent neural network. In: 2019 4th International Conference on information technology (InCIT), Bangkok, Thailand, 2019, pp. 38–43, https://doi.org/10.1109/INCIT.2019.8912049
Wang J, Wang L, Zhang J, Wei J, Yu M, Yu R (2018) A large-scale depth-based multimodal audio-visual corpus in mandarin. In: IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Exeter, United Kingdom, 2018, pp. 881–885, https://doi.org/10.1109/HPCC/SmartCity/DSS.2018.00146
Ochiai T, Delcroix M, Kinoshita K, Ogawa A, Nakatani T (2019) Multimodal SpeakerBeam: single channel target speech extraction with audio-visual speaker clues. In: Interspeech 2019, pp 2718–2722, https://doi.org/10.21437/interspeech.2019-1513
Chung JS, Senior A, Vinyals O, Zisserman A (2017) Lip reading sentences in the wild. In: IEEE Conference on computer vision and pattern recognition (CVPR), Honolulu, HI, 2017, pp 3444–3453, https://doi.org/10.1109/CVPR.2017.367
Jha V, Namboodiri P, Jawahar CV (2018) Word spotting in silent lip videos. In: IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, 2018, pp 150–159, https://doi.org/10.1109/WACV.2018.00023
Thabet Z, Nabih A, Azmi K, Samy Y, Khoriba G, Elshehaly M (2018) Lipreading using a comparative machine learning approach. In: (2018) First International Workshop on Deep and Representation Learning (IWDRL), Cairo, 2018, pp 19–25, https://doi.org/10.1109/IWDRL.2018.8358210
Kumar Y, Aggarwal M, Nawal P, Satoh S, Ratn Shah R, Zimmermann R (2018) Harnessing AI for speech reconstruction using multi-view silent video feed. In: 2018 ACM Multimedia Conference (MM ’18), October 22–26, 2018, Seoul, Republic of Korea. ACM, New York, NY, USA, p 9. https://doi.org/10.1145/3240508.3241911.
Lu Y, Liu Q (2018) (2018) Lip segmentation using automatic selected initial contours based on localized active contour model. Eurasip J Image Video Process 1:2018. https://doi.org/10.1186/s13640-017-0243-9
Article Google Scholar
Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R (2002) Extraction of visual features for lipreading. IEEE Trans Pattern Anal Mach Intell 24:198–213. https://doi.org/10.1109/34.982900
Article Google Scholar
Mesbah A, Hammouchi H, Berrahou A et al (2019) Lip Reading with Hahn convolutional neural networks moments. Image Vis Comput 88:76–83. https://doi.org/10.1016/j.imavis.2019.04.010
Article Google Scholar
Shashidhar R, Patilkulkarni S (2021) Visual speech recognition for small scale dataset using VGG16 convolution neural network. Multimed Tools Appl 15:14. https://doi.org/10.1007/s11042-021-11119-0
Article Google Scholar
Xu X, Xu D, Jia J, Wang Y, Chen B (2021) MFFCN: multi-layer feature fusion convolution network for audio-visual speech enhancement. arXiv: abs/2101.05975
Feng W, Guan N, Li Y, Zhang X, Luo Z (2017) Audio visual speech recognition with multimodal recurrent neural networks. In: 2017 International Joint Conference on neural networks (IJCNN), 2017, pp. 681–688, https://doi.org/10.1109/IJCNN.2017.7965918

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, JSS Science and Technology University, Mysore, 570006, India
R. Shashidhar & S. Patilkulkarni
Departments of Electronics and Communication Engineering, Presidency University, Bangalore, 560064, India
S. B. Puneeth

Authors

R. Shashidhar
View author publications
You can also search for this author in PubMed Google Scholar
S. Patilkulkarni
View author publications
You can also search for this author in PubMed Google Scholar
S. B. Puneeth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R. Shashidhar.

Ethics declarations

Conflicts of interest

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shashidhar, R., Patilkulkarni, S. & Puneeth, S.B. Combining audio and visual speech recognition using LSTM and deep convolutional neural network. Int. j. inf. tecnol. 14, 3425–3436 (2022). https://doi.org/10.1007/s41870-022-00907-y

Download citation

Received: 17 November 2021
Accepted: 13 February 2022
Published: 24 February 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s41870-022-00907-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining audio and visual speech recognition using LSTM and deep convolutional neural network

Abstract

Access this article

Similar content being viewed by others

Audio Visual Speech Recognition Using Deep Recurrent Neural Networks

Improving Audio-Visual Speech Recognition Using Gabor Recurrent Neural Networks

Audio-visual speech recognition using deep learning

Availability of data and material

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Combining audio and visual speech recognition using LSTM and deep convolutional neural network

Abstract

Access this article

Similar content being viewed by others

Audio Visual Speech Recognition Using Deep Recurrent Neural Networks

Improving Audio-Visual Speech Recognition Using Gabor Recurrent Neural Networks

Audio-visual speech recognition using deep learning

Availability of data and material

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation