Abstract
Human speech is bimodal, whereas audio speech relates to the speaker's acoustic waveform. Lip motions are referred to as visual speech. Audiovisual Speech Recognition is one of the emerging fields of research, particularly when audio is corrupted by noise. In the proposed AVSR system, a custom dataset was designed for English Language. Mel Frequency Cepstral Coefficients technique was used for audio processing and the Long Short-Term Memory (LSTM) method for visual speech recognition. Finally, integrate the audio and visual into a single platform using a deep neural network. From the result, it was evident that the accuracy was 90% for audio speech recognition, 71% for visual speech recognition, and 91% for audiovisual speech recognition, the result was better than the existing approaches. Ultimately model was skilled at enchanting many suitable decisions while forecasting the spoken word for the dataset that was used.
Similar content being viewed by others
Availability of data and material
The data that support the findings of this study are available from the corresponding author on reasonable request.
Code availability
The code available on reasonable request to the corresponding author.
References
Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A (2018) Deep audio- visual speech recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2018.2889052
Ivo I (2011) Speech and language technologies, pp 285–289, https://doi.org/10.5772/938
Shaikh AA, Kumar DK (2011) Visual speech recognition using optical flow and support vector machines. Int J Comput Intell Appl 10:171. https://doi.org/10.1142/S1469026811003045
Stafylakis T, Tzimiropoulos G (2017) Combining residual networks with LSTMs for lip reading https://doi.org/10.21437/Interspeech.2017-85
Shillingford B, Assael YM, Hoffman MW, Paine TL, Hughes C, Prabhu U, Liao H, Sak H, Rao K, Bennett L, Mulville M, Coppin B, Laurie B, Senior AW, Freitas ND (2019) Large-scale visual speech recognition. arXiv: abs/1807.05162
Courtney L, Sreenivas R (2019) Learning from videos with deep convolutional LSTM networks. arXiv preprint. arXiv: 1904.04817.
Sterpu G, Saam C, Harte N (2018) Can DNNs learn to lipread full sentences? In: 25th IEEE International Conference on Image Processing (ICIP), Athens, 2018, pp 16–20, https://doi.org/10.1109/ICIP.2018.8451388
Kumar Y, Jain R, Salik K, Shah RR, Yin Y, Zimmermann R (2019) Lipper: synthesizing thy speech using multi-view lipreading. In: Proceedings of the AAAI Conference on artificial intelligence. 33: 2588–2595, https://doi.org/10.1609/aaai.v33i01.33012588
Xu K, Li D, Cassimatis N, Wang X (2018) LCANet: End-to-end lipreading with cascaded attention-CTC. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, 2018, pp 548–5558, https://doi.org/10.1109/FG.2018.00088
Margam D, Aralikatti R, Sharma T, Thanda A, Pujitha K, Roy S, Venkatesan S (2019) LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models. Published in ArXiv 2019. https://dblp.org/db/journals/corr/corr1906.html#abs-1906-12170. Accessed Jan 2021
Stafylakis T, Khan MH, Tzimiropoulos G (2018) Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs. Comput Vis Image Underst Vol 176–177:22–32. https://doi.org/10.1016/j.cviu.2018.10.003
Zhang S, Lei M, Ma B, Xie L (2019) Robust audio-visual speech recognition using bimodal Dfsmn with multi-condition training and dropout regularization. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp 6570–6574, https://doi.org/10.1109/ICASSP.2019.8682566
Noda K, Yamaguchi Y, Nakadai K, Okuno H, Ogata T (2014) Audio-visual speech recognition using deep learning. Appl Intell 42:722–737. https://doi.org/10.1007/s10489-014-0629-7
Petridis S, Li Z, Pantic M (2017) End-to-end visual speech recognition with LSTMS. In: IEEE International Conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, 2017, pp 2592–2596, https://doi.org/10.1109/ICASSP.2017.7952625
Tao F, Busso C (2018) Gating neural network for large vocabulary audiovisual speech recognition. IEEE/ACM Trans Audio Speech Lang Process 26(7):1290–1302. https://doi.org/10.1109/TASLP.2018.2815268
Petridis S, Stafylakis T, Ma P, Tzimiropoulos G, Pantic M (2018) Audio-visual speech recognition with a hybrid CTC/attention architecture. In: IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 2018, pp 513–520, https://doi.org/10.1109/SLT.2018.8639643
Goh Y, Lau K, Lee Y (2019) Audio-visual speech recognition system using recurrent neural network. In: 2019 4th International Conference on information technology (InCIT), Bangkok, Thailand, 2019, pp. 38–43, https://doi.org/10.1109/INCIT.2019.8912049
Wang J, Wang L, Zhang J, Wei J, Yu M, Yu R (2018) A large-scale depth-based multimodal audio-visual corpus in mandarin. In: IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Exeter, United Kingdom, 2018, pp. 881–885, https://doi.org/10.1109/HPCC/SmartCity/DSS.2018.00146
Ochiai T, Delcroix M, Kinoshita K, Ogawa A, Nakatani T (2019) Multimodal SpeakerBeam: single channel target speech extraction with audio-visual speaker clues. In: Interspeech 2019, pp 2718–2722, https://doi.org/10.21437/interspeech.2019-1513
Chung JS, Senior A, Vinyals O, Zisserman A (2017) Lip reading sentences in the wild. In: IEEE Conference on computer vision and pattern recognition (CVPR), Honolulu, HI, 2017, pp 3444–3453, https://doi.org/10.1109/CVPR.2017.367
Jha V, Namboodiri P, Jawahar CV (2018) Word spotting in silent lip videos. In: IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, 2018, pp 150–159, https://doi.org/10.1109/WACV.2018.00023
Thabet Z, Nabih A, Azmi K, Samy Y, Khoriba G, Elshehaly M (2018) Lipreading using a comparative machine learning approach. In: (2018) First International Workshop on Deep and Representation Learning (IWDRL), Cairo, 2018, pp 19–25, https://doi.org/10.1109/IWDRL.2018.8358210
Kumar Y, Aggarwal M, Nawal P, Satoh S, Ratn Shah R, Zimmermann R (2018) Harnessing AI for speech reconstruction using multi-view silent video feed. In: 2018 ACM Multimedia Conference (MM ’18), October 22–26, 2018, Seoul, Republic of Korea. ACM, New York, NY, USA, p 9. https://doi.org/10.1145/3240508.3241911.
Lu Y, Liu Q (2018) (2018) Lip segmentation using automatic selected initial contours based on localized active contour model. Eurasip J Image Video Process 1:2018. https://doi.org/10.1186/s13640-017-0243-9
Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R (2002) Extraction of visual features for lipreading. IEEE Trans Pattern Anal Mach Intell 24:198–213. https://doi.org/10.1109/34.982900
Mesbah A, Hammouchi H, Berrahou A et al (2019) Lip Reading with Hahn convolutional neural networks moments. Image Vis Comput 88:76–83. https://doi.org/10.1016/j.imavis.2019.04.010
Shashidhar R, Patilkulkarni S (2021) Visual speech recognition for small scale dataset using VGG16 convolution neural network. Multimed Tools Appl 15:14. https://doi.org/10.1007/s11042-021-11119-0
Xu X, Xu D, Jia J, Wang Y, Chen B (2021) MFFCN: multi-layer feature fusion convolution network for audio-visual speech enhancement. arXiv: abs/2101.05975
Feng W, Guan N, Li Y, Zhang X, Luo Z (2017) Audio visual speech recognition with multimodal recurrent neural networks. In: 2017 International Joint Conference on neural networks (IJCNN), 2017, pp. 681–688, https://doi.org/10.1109/IJCNN.2017.7965918
Funding
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shashidhar, R., Patilkulkarni, S. & Puneeth, S.B. Combining audio and visual speech recognition using LSTM and deep convolutional neural network. Int. j. inf. tecnol. 14, 3425–3436 (2022). https://doi.org/10.1007/s41870-022-00907-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-022-00907-y