Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Li, Yu; Xue, Feng; Wu, Lin; Xie, Yincen; Li, Shujie

doi:10.1007/s00530-023-01226-3

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Regular Paper
Published: 25 January 2024

Volume 30, article number 42, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Yu Li¹,
Feng Xue²,
Lin Wu¹,
Yincen Xie¹ &
…
Shujie Li²

195 Accesses
Explore all metrics

Abstract

Lipreading refers to translating the lip motion regarding a video speaker into the corresponding texts. Existing lipreading methods typically describe the lip motion using visual appearance variations. However, merely using the lip visual variations is prone to associating with inaccurate texts due to the similar lip shapes for different words. Also, visual features are hard to generalize to unseen speakers, especially when the training data is limited. In this paper, we leverage both lip visual motion and facial landmarks and propose an effective sentence-level end-to-end approach for lipreading. The facial landmarks are introduced to eliminate the irrelevant visual features which are sensitive to specific lip appearance of individual speakers. This enables the model to adapt to different lip shapes of speakers and generalize to unseen speakers. In specific, the proposed framework consists of two branches corresponding to the visual features and facial landmarks. The visual branch extracts high-level visual features from the lip movement, and the landmark branch learns to extract both spatial and temporal patterns described by the landmarks. The feature embeddings from two streams for each frame are fused to form its latent vector which can be decoded into texts. We employ a sequence-to-sequence model to operate the feature embeddings of all frames as input, and decode them to generate the texts. The proposed method is demonstrated to well generalize to unseen speakers on benchmark data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Lip Reading Using Temporal Adaptive Module

Speaker-Adaptive Lip Reading with User-Dependent Padding

Utilizing Lipreading in Large Vocabulary Continuous Speech Recognition

Data availability

We conducted experiments on publicly available datasets CMLR and GRID. Below are the official links for these two datasets:

- Official link for the CMLR dataset: https://www.vipazoo.cn/CMLR.html

- Official link for the GRID dataset: https://spandh.dcs.shef.ac.uk/gridcorpus/.

References

Zhao, Y., Xu, R., Wang, X., Hou, P., Tang, H., Song, M.: Hearing lips: improving lip reading by distilling speech recognizers. In: AAAI Conference on Artificial Intelligence, pp. 6917–6924 (2020)
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Computer Vision—ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part II 13, pp. 251–263. Springer (2017)
Kim, J.O., Lee, W., Hwang, J., Baik, K.S., Chung, C.H.: Lip print recognition for security systems by multi-resolution architecture. Future Gener. Comput. Syst. 20(2), 295–301 (2004)
Article Google Scholar
Chen, X., Du, J., Zhang, H.: Lipreading with DenseNet and RESBI-LSTM. Signal Image Video Process. 14, 981–989 (2020)
Article Google Scholar
Lee, D., Lee, J., Kim, K.-E.: Multi-view automatic lip-reading using neural network. In: Computer Vision—ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part II 13, pp. 290–302. Springer (2017)
Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6319–6323. IEEE (2020)
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018)
Zhang, T., He, L., Li, X., Feng, G.: Efficient end-to-end sentence-level lipreading with temporal convolutional networks. Appl. Sci. 11(15), 6975 (2021)
Article CAS Google Scholar
Xu, K., Li, D., Cassimatis, N., Wang, X.: LCANet: End-to-end lipreading with cascaded attention-CTC. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 548–555. IEEE (2018)
Margam, D.K., Aralikatti, R., Sharma, T., Thanda, A., Roy, S., Venkatesan, S.M., et al.: Lipreading with 3D-2D-CNN BLSTM-HMM and word-CTC models. arXiv preprint arXiv:1906.12170 (2019)
Assael, Y.M., Shillingford, B., Whiteson, S., De Freitas, N.: LIPNet: end-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)
Zhao, Y., Xu, R., Song, M.: A cascade sequence-to-sequence model for Chinese mandarin lip reading. In: Proceedings of the ACM Multimedia Asia, pp. 1–6 (2019)
Haghpanah, M.A., Saeedizade, E., Masouleh, M.T., Kalhor, A.: Real-time facial expression recognition using facial landmarks and neural networks. In: 2022 International Conference on Machine Vision and Image Processing (MVIP), pp. 1–7. IEEE (2022)
Lo, L., Xie, H.-X., Shuai, H.-H., Cheng, W.-H.: MER-GCN: micro-expression recognition based on relation modeling with graph convolutional networks. In: 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 79–84. IEEE (2020)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105 (2017)
Hao, M., Mamut, M., Yadikar, N., Aysa, A., Ubul, K.: A survey of research on lipreading technology. IEEE Access 8, 204518–204544 (2020)
Article Google Scholar
Xiao, J., Yang, S., Zhang, Y., Shan, S., Chen, X.: Deformation flow based two-stream network for lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 364–370. IEEE (2020)
Stafylakis, T., Khan, M.H., Tzimiropoulos, G.: Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs. Comput. Vis. Image Underst. 176, 22–32 (2018)
Article Google Scholar
Zhao, X., Yang, S., Shan, S., Chen, X.: Mutual information maximization for effective lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 420–427. IEEE (2020)
Wand, M., Schmidhuber, J.: Improving speaker-independent lipreading with domain-adversarial training. arXiv preprint arXiv:1708.01565 (2017)
Zhang, X., Gong, H., Dai, X., Yang, F., Liu, N., Liu, M.: Understanding pictograph with facial features: end-to-end sentence-level lip reading of Chinese. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9211–9218 (2019)
Xue, F., Yang, T., Liu, K., Hong, Z., Cao, M., Guo, D., Hong, R.: LCSNet: end-to-end lipreading with channel-aware feature selection. ACM Trans. Multimedia. Comput. Commun. Appl. 9, 1–21 (2023)
CAS Google Scholar
Kim, M., Yeo, J.H., Choi, J., Ro, Y.M.: Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15359–15371 (2023)
Sun, B., Xie, D., Shi, H.: MALip: modal amplification lipreading based on reconstructed audio features. Signal Process. Image Commun. 117, 117002 (2023)
Article Google Scholar
Santos, T.I., Abel, A., Wilson, N., Xu, Y.: Speaker-independent visual speech recognition with the inception v3 model. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 613–620. IEEE (2021)
Nemani, P., Krishna, G.S., Ramisetty, N., Sai, B.D.S., Kumar, S.: Deep Learning-Based Holistic Speaker Independent Visual Speech Recognition. IEEE Trans. Artif. Intell. 4(6), 1705–1713 (2023)
Article Google Scholar
Huang, Y., Liang, X., Fang, C.: CALLip: Lipreading using contrastive and attribute learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2492–2500 (2021)
Kim, M., Kim, H., Ro, Y.M.: Speaker-adaptive lip reading with user-dependent padding. In: European Conference on Computer Vision, pp. 576–593. Springer (2022)
Kim, M., Kim, H.-I., Ro, Y.M.: Prompt tuning of deep neural networks for speaker-adaptive visual speech recognition. arXiv preprint arXiv:2302.08102 (2023)
Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., Tian, Q.: Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 214–223 (2020)
Tang, S., Guo, D., Hong, R., Wang, M.: Graph-based multimodal sequential embedding for sign language translation. IEEE Transac. Multimed. 24, 4433–4445 (2021)
Article Google Scholar
Papadimitriou, K., Potamianos, G.: Sign language recognition via deformable 3D convolutions and modulated graph convolutional networks. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Ban, Y., Eckhoff, J.A., Ward, T.M., Hashimoto, D.A., Meireles, O.R., Rus, D., Rosman, G.: Concept graph neural networks for surgical video understanding. IEEE Trans. Med. Imaging. 43(1), 264–274 (2024)
Article PubMed Google Scholar
Amodio, A., Ermidoro, M., Maggi, D., Formentin, S., Savaresi, S.M.: Automatic detection of driver impairment based on pupillary light reflex. IEEE Trans. Intell. Transp. Syst. 20(8), 3038–3048 (2018)
Article Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)
Article Google Scholar
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)
Article ADS PubMed Google Scholar
Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710. Soviet Union (1966)
Son Chung, J., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6447–6456 (2017)

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 62272143, in part by the University Synergy Innovation Program of Anhui Province under Grant GXXT-2022-054, in part by the Anhui Provincial Major Science and Technology Project under Grant 202203a05020025, and in part by the Seventh Special Support Plan for Innovation and Entrepreneurship in Anhui Province.

Author information

Authors and Affiliations

School of Computer Science and Technology, Hefei University of Technology, Hefei, Anhui, China
Yu Li, Lin Wu & Yincen Xie
School of Software, Hefei University of Technology, Hefei, Anhui, China
Feng Xue & Shujie Li

Authors

Yu Li
View author publications
You can also search for this author in PubMed Google Scholar
Feng Xue
View author publications
You can also search for this author in PubMed Google Scholar
Lin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yincen Xie
View author publications
You can also search for this author in PubMed Google Scholar
Shujie Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YL primarily conducted experimental validation and prepared figures, contributed to the writing of the original draft, reviewed, and edited. FX, LW and SL primarily focused on reviewing and editing. YX conducted experimental validation and prepared figures.

Corresponding author

Correspondence to Feng Xue.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by H. Li.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, Y., Xue, F., Wu, L. et al. Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach. Multimedia Systems 30, 42 (2024). https://doi.org/10.1007/s00530-023-01226-3

Download citation

Received: 26 July 2023
Accepted: 08 December 2023
Published: 25 January 2024
DOI: https://doi.org/10.1007/s00530-023-01226-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Abstract

Access this article

Similar content being viewed by others

Lip Reading Using Temporal Adaptive Module

Speaker-Adaptive Lip Reading with User-Dependent Padding

Utilizing Lipreading in Large Vocabulary Continuous Speech Recognition

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Abstract

Access this article

Similar content being viewed by others

Lip Reading Using Temporal Adaptive Module

Speaker-Adaptive Lip Reading with User-Dependent Padding

Utilizing Lipreading in Large Vocabulary Continuous Speech Recognition

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation