Abstract
Speech perception is recognized as a multimodal task, that is, it solicits more than one meaning. Lip reading, which superimposes visual signals to auditory signals, is useful and sometimes even necessary for understanding a message. Lip-reading is an area of great importance for a wide range of applications, such as silent dictation, speech recognition in noisy environment, improved hearing aids and biometrics. It is a difficult research subject in the field of computer vision, whose main purpose is to observe the movement of human lips from the video to identify the corresponding textual content. However, because of the limitations of lip changes and the richness of linguistic content, the increased difficulty of lip recognition slows down the development of lip language research topics. Recently, the development of deep learning in various fields gives us enough confidence to carry out the task of lip recognition. Unlike recognition of lip characteristics in traditional lip recognition, lip learning based on deep learning typically involves extracting features and understanding images using a network model. In this topic, we focus on the design of the acquisition, processing, and data recognition network framework for lip reading. In this work, we developed an accurate and robust algorithm, for lip reading. First, we extract the mouth region and segmented the mouth by using a proposed hybrid model with a new proposed edge based on a proposed filter, then we train our spatio-temporal model by the combination of Convolutional Neural Networks (CNN) and Bi-directional Gated Recurrent Units (Bi-GRU). Finally, we test our algorithm, and we get an evaluation of 90.38% of accuracy. The result shows the performance of our system by application of lip segmented as inputs to the proposed spatio-temporal model.
Similar content being viewed by others
References
Agrawal S, Omprakash VR (2016) Lip reading techniques: a survey. Proceedings of the 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), pp 753–757
Bradski G (2000) Opencv in Dr. Dobbs J Softw Tools
Caselles V, Kimmel R, Sapiro G (1995) “Geodesic active contours”, IEEE Int Conf Comput Vis
Chan T, Vese LA (2001) Active contours without edges. IEEE Trans Image Process 10:266–277
Chen Y, Kang Y, Chen Y, Wang Z (2020) Probabilistic forecasting with temporal convolutional neural network. Neurocomputing J 399:491–501
Cheng S, Ma P, Tzimiropoulos G, Petridis S, Bulat A, Shen J; Pantic M (2020) “Towards Pose-Invariant Lip-Reading”, In Proceedings of the ICASSP 2020– IEEE Int Conf Acoustics Speech Signal Process (ICASSP), pp. 4357–4361
Chung JS, Zisserman A (2016) Lip reading in the wild. Asian Conference on Computer Vision, pp 87–103
Courtney L, Sreenivas R (2019) Using deep convolutional LSTM networks for learning spatiotemporal features. In: Proceedings of the ACPR 2019: Pattern Recogn
Dahl GE, Sainath TN: Hinton GE (2013) Improving deep neural networks for LVCSR using rectified linear units and dropout. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–30 May 2013
Danielis A, Giorgi D (2017) Lip segmentation based on Lambertian shadings and morphological operators for hyper-spectral images. Pattern Recogn 63:1–48
Eveno, N., Caplier, A., and Coulon, P.Y: “Automatic and accurate lip tracking”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, N°.5, pp.706–715, (2004).
Houneida S, Ramzi M, Houneida S, Ramzi M, Mohamed A, Mourad S, Moncef T (2019) Moving towards a 5D cardiac model. Journal of Flow Visualization & Image Processing 26:19–48
Jiang H, Feng R, Gao X (2011) Level set based on signed pressure force function and its application in liver image segmentation. Wuhan University Journal of Natural Sciences 16(3):265–270
Kass M, Witkin A, Terzopoulos D (1988) Snakes: active contour models. Int J Comput Vis 1:321–331
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet Classification with Deep Convolutional Neural Networks. Proceedings of the 25th International Conference on Neural Information Processing Systems 1:1097–1105
Li C, Xu C (2010) Distance regularized level set evolution and its application to image segmentation. IEEE Trans Image Process 19(12):3243–3254
Li C, Kao C-Y, Gore J et al (2008) Minimization of region scalable fitting energy for image segmentation. IEEE Trans Image Process 17(10):1940–1949
Liew A, Leung SH, Lau WH (2003) Segmentation of color lip images by spatial fuzzy clustering. IEEE Trans Fuzzy Syst 11(1):542–549
Luo M, Yang S, Shan S, Chen X (2020) Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading. In: Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition, pp 273–280
Martinez B, Ma P, Petridis S, Maja P (2020) Lipreading using temporal convolutional networks. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 6319–6323
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML), pp 689–696
Paragios N, Deriche R (2002) Geodesic active regions and level set methods for supervised texture segmentation. Int J Comput Vis 46(3):223–247
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, pp 91–99
Ronneberger O, Fischer P (2015) Brox, T. U-net: Convolutional networks for biomedical image segmentation. In: Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, pp 234–241
Stafylakis T, Tzimiropoulos G (2017) Combining residual networks with LSTMs for Lipreading. ISCA Interspeech
Wand M, Koutnic J, Schmidhuber J (2016) Lipreading with long shortterm memory. In: Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp 6115–6119
Wang L, Wu H, Pan C (2013) Region-based image segmentation with local signed difference energy. Pattern Recogn Lett 34:637–645
Xiao J, Yang S, Zhang Y, Shan S, Chen X (2020) “Deformation Flow Based Two-Stream Network for Lip Reading”, Proceedings of the 2020 15th IEEE Int Conf Automatic Face Gesture Recogn, pp. 364–370
Yan X, Li X, Zhang L, LI F (2010) Robust lip segmentation method based on level set model. In: Proceedings of the 11th Pacific Rim Conference Multimedia, pp 731–739
Yangyang H, Hong L, Jinhua C (2016) Robust lip segmentation based on complexion mixture model. Conference: Pacific Rim Conference on Multimedia 9916:85–94
Zhu H, Chen H, Brown R (2018) A sequence-to-sequence model-based deep learning approach for recognizing activity of daily living for senior care. J Biomed Inform 84:148–158
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Miled, M., Messaoud, M.A.B. & Bouzid, A. Lip reading of words with lip segmentation and deep learning. Multimed Tools Appl 82, 551–571 (2023). https://doi.org/10.1007/s11042-022-13321-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13321-0