Hybrid Training of Speaker and Sentence Models for One-Shot Lip Password

Ruengprateepsang, Kavin; Wangsiripitak, Somkiat; Pasupa, Kitsuchart

doi:10.1007/978-3-030-63830-6_31

Hybrid Training of Speaker and Sentence Models for One-Shot Lip Password

Kavin Ruengprateepsang¹⁴,
Somkiat Wangsiripitak¹⁴ &
Kitsuchart Pasupa¹⁴

Conference paper
First Online: 19 November 2020

2232 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12532))

Abstract

Lip movement can be used as an alternative approach for biometric authentication. We describe a novel method for lip password authentication, using end-to-end 3D convolution and bidirectional long-short term memory. By employing triplet loss to train deep neural networks and learn lip motions, representation of each class is more compact and isolated: less classification error is achieved on one-shot learning of new users with our baseline approach. We further introduce a hybrid model, which combines features from two different models; a lip reading model that learns what phrases uttered by the speaker and a speaker authentication model that learns the identity of the speaker. On a publicly available dataset, AV Digits, we show that our hybrid model achieved an 9.0% equal error rate, improving on 15.5% with the baseline approach.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: end-to-end sentence-level lipreading. arXiv:1611.01599 [cs] (2016)
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Salah, A.A., Lepri, B. (eds.) HBU 2011. LNCS, vol. 7065, pp. 29–39. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25446-8_4
Chapter Google Scholar
Bromley, J., et al.: Signature verification using a “siamese” time delay neural network. Int. J. Pattern Recogn. Artif. Intell. 669–688 (1993). https://doi.org/10.1142/S0218001493000339
Cheung, Y., Zhou, Y.: Lip password-based speaker verification without a priori knowledge of speech language. In: Li, K., Li, W., Chen, Z., Liu, Y. (eds.) ISICA 2017. CCIS, vol. 874, pp. 461–472. Springer, Singapore (2018). https://doi.org/10.1007/978-981-13-1651-7_41
Chapter Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:1412.3555 [cs] (2014)
Fung, I., Mak, B.: End-to-end low-resource lip-reading with Maxout CNN and LSTM. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, pp. 2511–2515. IEEE (2018). https://doi.org/10.1109/ICASSP.2018.8462280
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Jain, R., Kant, C.: Attacks on biometric systems: an overview. Int. J. Adv. Sci. Res. 283 (2015). https://doi.org/10.7439/ijasr.v1i7.1975
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 221–231 (2013). https://doi.org/10.1109/TPAMI.2012.59
Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, pp. 1867–1874. IEEE (2014). https://doi.org/10.1109/CVPR.2014.241
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 1755–1758 (2009). https://doi.org/10.1145/1577069.1755843
Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs] (2017)
Lai, J.Y., Wang, S.L., Liew, A.W.C., Shi, X.J.: Visual speaker identification and authentication by joint spatiotemporal sparse coding and hierarchical pooling. Inf. Sci. 219–232 (2016). https://doi.org/10.1016/j.ins.2016.09.015
Liao, J., Wang, S., Zhang, X., Liu, G.: 3D convolutional neural networks based speaker identification and authentication. In: 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, pp. 2042–2046. IEEE (2018). https://doi.org/10.1109/ICIP.2018.8451204
Liu, X., Cheung, Y.M.: Learning multi-boosted HMMs for lip-password based speaker verification. IEEE Trans. Inf. Forensics Secur. 233–246 (2014). https://doi.org/10.1109/TIFS.2013.2293025
Maître, G., Valais-Wallis, H.S., Luettin, J., GmbH, R.B.: XM2VTSDB: the extended M2VTS database, p. 7 (2000)
Google Scholar
Micikevicius, P., et al.: Mixed Precision Training. arXiv:1710.03740 [cs, stat] (2018)
Petridis, S., Shen, J., Cetin, D., Pantic, M.: Visual-Only Recognition of Normal, Whispered and Silent Speech. arXiv:1802.06399 [cs] (2018)
Qiu, Z., Yao, T., Mei, T.: Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks. arXiv:1711.10305 [cs] (2017)
Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: database and results. Image Vision Comput. 3–18 (2016). https://doi.org/10.1016/j.imavis.2016.01.002
Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: the first facial landmark localization challenge. In: 2013 IEEE International Conference on Computer Vision Workshops, Sydney, Australia, pp. 397–403. IEEE (2013). https://doi.org/10.1109/ICCVW.2013.59
Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: A semi-automatic methodology for facial landmark annotation. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, OR, USA, pp. 896–903. IEEE (2013). https://doi.org/10.1109/CVPRW.2013.132
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823 (2015). https://doi.org/10.1109/CVPR.2015.7298682
Schuster, M., Paliwal, K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 2673–2681 (1997). https://doi.org/10.1109/78.650093
Shi, X.X., Wang, S.L., Lai, J.Y.: Visual speaker authentication by ensemble learning over static and dynamic lip details. In: 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, pp. 3942–3946. IEEE (2016). https://doi.org/10.1109/ICIP.2016.7533099
Sun, L., Jia, K., Yeung, D.Y., Shi, B.E.: Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks. arXiv:1510.00562 [cs] (2015)
Tibshirani, R., Johnstone, I., Hastie, T., Efron, B.: Least angle regression. Ann. Stat. 407–499 (2004). https://doi.org/10.1214/009053604000000067
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A Closer Look at Spatiotemporal Convolutions for Action Recognition. arXiv:1711.11248 [cs] (2017)
Wang, M., Deng, W.: Deep Face Recognition: A Survey. arXiv:1804.06655 [cs] (2018)
Wang, Y., Yao, Q., Kwok, J., Ni, L.M.: Generalizing from a Few Examples: A Survey on Few-Shot Learning. arXiv:1904.05046 [cs] (2019)
Wright, C., Stewart, D.: One-shot-learning for visual lip-based biometric authentication. In: Bebis, G., et al. (eds.) ISVC 2019. LNCS, vol. 11844, pp. 405–417. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33720-9_31
Chapter Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification. arXiv:1712.04851 [cs] (2018)

Download references

Author information

Authors and Affiliations

Faculty of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand
Kavin Ruengprateepsang, Somkiat Wangsiripitak & Kitsuchart Pasupa

Authors

Kavin Ruengprateepsang
View author publications
You can also search for this author in PubMed Google Scholar
Somkiat Wangsiripitak
View author publications
You can also search for this author in PubMed Google Scholar
Kitsuchart Pasupa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Somkiat Wangsiripitak .

Editor information

Editors and Affiliations

Department of AI, Ping An Life, Shenzhen, China
Haiqin Yang
Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand
Kitsuchart Pasupa
City University of Hong Kong, Kowloon, Hong Kong
Andrew Chi-Sing Leung
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
James T. Kwok
School of Information Technology, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand
Jonathan H. Chan
The Chinese University of Hong Kong, New Territories, Hong Kong
Irwin King

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ruengprateepsang, K., Wangsiripitak, S., Pasupa, K. (2020). Hybrid Training of Speaker and Sentence Models for One-Shot Lip Password. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Lecture Notes in Computer Science(), vol 12532. Springer, Cham. https://doi.org/10.1007/978-3-030-63830-6_31

Download citation

DOI: https://doi.org/10.1007/978-3-030-63830-6_31
Published: 19 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63829-0
Online ISBN: 978-3-030-63830-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics