Lip Reading Using Deformable 3D Convolution and Channel-Temporal Attention

Peng, Chen; Li, Jun; Chai, Jie; Zhao, Zhongqiu; Zhang, Housen; Tian, Weidong

doi:10.1007/978-3-031-15937-4_59

Chen Peng^12,13,14,15,
Jun Li¹³,
Jie Chai¹³,
Zhongqiu Zhao^12,13,14,15,
Housen Zhang^12,13,14,15 &
…
Weidong Tian^12,13,14,15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13532))

Included in the following conference series:

International Conference on Artificial Neural Networks

Abstract

At present, for lip-reading with isolated words, the front-end networks mostly use a combination of 3D convolutional layer and 2D convolutional network to extract features, and the back-end networks mostly use a temporal processing network for classification. However, the convolution of the front-end does not comply with the lip structures to extract spatial information, and the back-end cannot exploit all correlations of global spatio-temporal features. Therefore, in this paper, we propose a network with deformable 3D convolution (D3D) and channel-temporal attention (CT), where D3D adjusts the sampling position adaptively according to the lip structures, thus making more efficient utilization of spatial information, and CT exploits the intrinsic correlation of features to make the network concentrate on valuable key frames. Experiments prove the effectiveness of the proposed method in information extraction and show that our network achieves state-of-the-art performance for lip reading.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Assael, Y.M., Shillingford, B., Whiteson, S., Freitas, N.D.: Lipnet: end-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453 (2017)
Google Scholar
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision, pp. 87–103 (2016)
Google Scholar
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: The IEEE International Conference on Computer Vision (ICCV), pp. 764–773 (2017)
Google Scholar
Hao, M., Mamut, M., Yadikar, N., Aysa, A., Ubul, K.: How to use time information effectively? combining with time shift module for lipreading. In: The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7988–7992 (2021)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141 (2018)
Google Scholar
Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video understanding. In: The IEEE International Conference on Computer Vision (ICCV), pp. 7083–7093 (2019)
Google Scholar
Ma, P., Martinez, B., Petridis, S., Pantic, M.: Towards practical lipreading with distilled and efficient models. In: The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7608–7612 (2021)
Google Scholar
Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. In: The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6319–6323 (2020)
Google Scholar
Mnih, V., Heess, N.M.O., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems (NIPS), pp. 2204–2212 (2014)
Google Scholar
Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audio-visual speech recognition. IEEE Trans. Audio Speech Lang. Process., 423–435 (2008)
Google Scholar
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552 (2018)
Google Scholar
Petridis, S., Wang, Y., Li, Z., Pantic, M.: End-to-end audiovisual fusion with lstms. In: International Conference on Auditory-visual Speech Processing (2017)
Google Scholar
Rekik, A., Ben-Hamadou, A., Mahdi, W.: Human machine interaction via visual speech spotting. In: International Conference on Advanced Concepts for Intelligent Vision Systems, pp. 566–574 (2015)
Google Scholar
Shaikh, A.A., Kumar, D.K., Yau, W.C., Azemin, M.C., Gubbi, J.: Lip reading using optical flow and support vector machines. In: 2010 3Rd International Congress on Image and Signal Processing, pp. 327–330 (2010)
Google Scholar
Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with lstms for lipreading. In: arXiv preprint arXiv:1703.04105 (2017)
Tian, W.D., Zhang, H.S., Peng, C., Zhao, Z.Q.: Lipreading model based on whole-parl collaborative learning. In: The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2425–2429 (2022)
Google Scholar
Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with long short-term memory. In: The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6115–6119 (2016)
Google Scholar
Xiao, J., Yang, S., Zhang, Y., Shan, S., Chen, X.: Deformation flow based two-stream network for lip reading. In: IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 364–370 (2020)
Google Scholar
Xu, K., Li, D., Cassimatis, N., Wang, X.: Lcanet: End-to-end lipreading with cascaded attention-ctc. In: IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 548–555 (2018)
Google Scholar
Yang, S., Zhang, Y., Feng, D., Yang, M., Wang, C., Xiao, J., Long, K., Shan, S., Chen, X.: Lrw-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In: IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pp. 1–8 (2019)
Google Scholar
Zhao, X., Yang, S., Shan, S., Chen, X.: Mutual information maximization for effective lip reading. In: IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 420–427 (2020)
Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 61976079, in part by Guangxi Key Research and Development Program under Grant 2021AB20147, and in part by Anhui Key Research and Development Program under Grant 202004a05020039.

Author information

Authors and Affiliations

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, 230009, China
Chen Peng, Zhongqiu Zhao, Housen Zhang & Weidong Tian
Fiber Inspection Bureau of Anhui Province, Chunjiang, China
Chen Peng, Jun Li, Jie Chai, Zhongqiu Zhao, Housen Zhang & Weidong Tian
Intelligent Manufacturing Institute of HFUT, Hefei, China
Chen Peng, Zhongqiu Zhao, Housen Zhang & Weidong Tian
Guangxi Academy of Sciences, Guangxi, China
Chen Peng, Zhongqiu Zhao, Housen Zhang & Weidong Tian

Authors

Chen Peng
View author publications
You can also search for this author in PubMed Google Scholar
Jun Li
View author publications
You can also search for this author in PubMed Google Scholar
Jie Chai
View author publications
You can also search for this author in PubMed Google Scholar
Zhongqiu Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Housen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Weidong Tian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhongqiu Zhao .

Editor information

Editors and Affiliations

University of the West of England, Bristol, UK
Elias Pimenidis
Lancaster University, Lancaster, UK
Plamen Angelov
Digital Innovation, Teeside University, Middlesbrough, UK
Chrisina Jayne
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
The University of the West of England, Bristol, UK
Mehmet Aydin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peng, C., Li, J., Chai, J., Zhao, Z., Zhang, H., Tian, W. (2022). Lip Reading Using Deformable 3D Convolution and Channel-Temporal Attention. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13532. Springer, Cham. https://doi.org/10.1007/978-3-031-15937-4_59

Download citation

DOI: https://doi.org/10.1007/978-3-031-15937-4_59
Published: 07 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15936-7
Online ISBN: 978-3-031-15937-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Lip Reading Using Deformable 3D Convolution and Channel-Temporal Attention