Abstract
Learning disentangled representation of data is a key problem in deep learning. Specifically, disentangling 2D facial landmarks into different factors (e.g., identity and expression) is widely used in the applications of face reconstruction, face reenactment and talking head et al.. However, due to the sparsity of landmarks and the lack of accurate labels for the factors, it is hard to learn the disentangled representation of landmarks. To address these problem, we propose a simple and effective model named FLD-VAE to disentangle arbitrary facial landmarks into identity and expression latent representations, which is based on a Variational Autoencoder framework. Besides, we propose three invariant loss functions in both latent and data levels to constrain the invariance of representations during training stage. Moreover, we implement an identity preservation loss to further enhance the representation ability of identity factor. To the best of our knowledge, this is the first work to end-to-end disentangle identity and expression factors simultaneously from one single facial landmark.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
V Blanz, T Vetter. A morphable model for the synthesis of 3d faces, the 26th annual conference on Computer graphics and interactive techniques, 1999, 187–194.
A Bulat, G Tzimiropoulos. How far are we from solving the 2d&3d face alignment problem? (and a dataset of 230,000 3d facial landmarks), IEEE International Conference on Computer Vision, 2017, 1021–1030.
E Burkov, I Pasechnik, A Grigorev, V Lempitsky. Neural head reenactment with latent pose descriptors, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, 13786–13795.
L Chen, Z Li, R K Maddox, Z Duan, C Xu. Lip movements generation at a glance, European Conference on Computer Vision (ECCV), 2018, 520–535.
X Chen, Y Duan, R Houthooft, J Schulman, I Sutskever, P Abbee. Infogan: interpretable representation learning by information maximizing generative adversarial nets, International Conference on Neural Information Processing Systems, 2016, 2180–2188.
J S Chung, A Zisserman. Lip reading in the wild, Asian Conference on Computer Vision, 2016, 87–103.
M Cooke, J Barker, S Cunningham, X Shao. An audio-visual corpus for speech perception and automatic speech recognition, The Journal of the Acoustical Society of America, 2006, 120(5): 2421–2424.
G E Dahl, T N Sainath, G E Hinton. Improving deep neural networks for lvcsr using rectified linear units and dropout, IEEE international conference on acoustics, speech and signal processing, 2013, 8609–8613.
J Deng, J Guo, N Xue, S Zafeidiou. Arcface: Additive angular margin loss for deep face recognition, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, 4685–4694.
Y Feng, H Feng, M J Black, T Bolkart. Learning an animatable detailed 3d face model from in-the-wild images, arXiv preprint, 2020, arXiv: 2012.04012.
Y Feng, F Wu, X Shao, Y Wang, X Zhou. Joint 3D face reconstruction and dense alignment with position map regression network, European Conference on Computer Vision (ECCV), 2018, 534–551.
T Gerig, A Morel-Forster, C Blumer, B Egger, M Luthi, S Schoenborn, T Vetter. Morphable Face Models - An Open Framework, 2018 13th IEEE International Conference on Automatic Face Gesture Recognition (FG 2013), 2018, 75–82.
I Gogic, J Ahlberg, I S Pandzic. Regression-based methods for face alignment: A survey, Signal Processing, 2021, 178: 107755–107774.
I Higgins, L Matthey, A Pal, C Burgess, X Glorot, M Botvinick, S Mohamed, A Lerchner. Beta-vae: Learning basic visual concepts with a constrained variational framework, International Conference on Learning Representations(ICLR), 2017.
X Hui. A survey for 2d and 3d face alignment, International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), 2019, 57–63.
Z H Jiang, Q Wu, K Chen, J Zhang. Disentangled representation learning for 3d face shape, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, 11949–11958.
T Karras, S Laine, T Aila. A style-based generator architecture for generative adversarial networks, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, 4401–4410.
D E King. Dlib-ML: A machine learning toolkit, The Journal of Machine Learning Research, 2009, 10: 1755–1758.
D P Kingma, M Welling. Auto-encoding variational bayes, arXiv preprint, 2013, arXiv:1312.6114.
T D Kulkarni, W F Whitney, P Kohli, J B Tenenbaum. Deep convolutional inverse graphics network, International Conference on Neural Information Processing Systems(NeurIPS), 2015, 2: 2539–2547.
O Langner, R Dotsch, G Bijlstra, D H Wigboldus, S T Hawk, A V Knippenberg. Presentation and validation of the radboud faces database, Cognition and Emotion, 2010, 24(8): 1377–1388.
W Lee, D Kim, S Hong, H Lee. High-Fidelity Synthesis with Disentangled Representation, arXiv e-prints, 2020, arXiv:2001.04296.
T Li, T Bolkart, M J Black, H Li, J Romero. Learning a model of facial shape and expression from 4d scans, ACM Transactions on Graphics (TOG), 2017, 36: 1–17.
A Paszke, S Gross, F Massa, A Lerer, J Bradbury, G Chanan, T Killeen, Z Lin, N Gimelshein, L Antiga, et al. Pytorch: An imperative style, high-performance deep learning library, arXiv preprint, 2019, arXiv: 1912.01703.
P Paysan, R Knothe, B Amberg, S Romdhani, T Vetter. A 3d face model for pose and illumination invariant face recognition, IEEE International Conference on Advanced Video and Signal Based Surveillance, 2009, 296–301.
H X Pham, Y Wang, V Pavlovic. End-to-end learning for 3d facial animation from speech, ACM International Conference on Multimodal Interaction, 2018, 361–365.
A Ranjan, T Bolkart, S Sanyal, M J Black. Generating 3d faces using convolutional mesh autoencoders, European Conference on Computer Vision (ECCV), 2018, 704–720.
A Richard, C Lea, S Ma, J Gall, F De La Torre, Y Sheikh. Audio-and gaze-driven facial animation of codec avatars, IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, 41–50.
S Sanyal, T Bolkart, H Feng, M J Black. Learning to regress 3d face shape and expression from an image without 3d supervision, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, 7763–7772.
S Sinha, S Biswas, B Bhowmick. Identity-preserving realistic talking face generation, International Joint Conference on Neural Networks (IJCNN), 2020, 1–10.
L Sirovich, M Kirby. Low-dimensional procedure for the characterization of human faces, Journal of the Optical Society of America A, 1987, 4(3): 519–24.
B Sisman, J Yamagishi, S King, H Li. An overview of voice conversion and its challenges:From statistical modeling to deep learning, IEEE/ACM Transactions on Audio, Speech and Language Processing, 2021, 29: 132–157.
J Thies, M Zollhofer, M Stamminger, C Theobalt, M Niessner. Face2face: Real-time face capture and reenactment of rgb videos, IEEE conference on computer vision and pattern recognition, 2016, 2387–2395.
X Wen, M Wang, C Richardt, Z Chen, S Hu. Photorealistic audio-driven video portraits, IEEE Transactions on Visualization and Computer Graphics, 2020, 26(12): 3457–3466.
S Xiang, Y Gu, P Xiang, M He, K Nagno, H Chen, H Li. One-shot identity-preserving portrait reenactment, arXiv e-prints, 2020, arXiv: 2004.12452.
Z Yang, W Zhu, W Wu, C Qian, Q Zhou, B Zhou, C Loy. Transmomo:Invariance-driven unsupervised video motion retargeting, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, 5306–5315.
M E Yumer, N J Mitra. Spectral style transfer for human motion between independent actions, ACM Transactions on Graphics (TOG), 2016, 35(4): 1–8.
J Zhang, X Zeng, M Wang, Y Pan, L Liu, Y Liu, Y Ding, C Fan. Freenet: Multi-identity face reenactment, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, 5326–5335.
H Zhou, Y Liu, Z Liu, P Luo, X Wang. Talking face generation by adversarially disentangled audio-visual representation, Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33: 9299–9306.
Y Zhou, X Han, E Shechtman, J Echevarria, E Kalogerakis, D Li. Makelttalk: speaker-aware talking-head animation, ACM Transactions on Graphics (TOG), 2020, 39(6): 1–15.
Funding
Supported by the National Natural Science Foundation of China(61210007).
Author information
Authors and Affiliations
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the articles Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articles Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liang, S., Zhou, Zz., Guo, Yd. et al. Facial landmark disentangled network with variational autoencoder. Appl. Math. J. Chin. Univ. 37, 290–305 (2022). https://doi.org/10.1007/s11766-022-4589-0
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11766-022-4589-0