Facial landmark disentangled network with variational autoencoder

. Learning disentangled representation of data is a key problem in deep learning. Speciﬁcally, disentangling 2D facial landmarks into diﬀerent factors ( e.g. , identity and expression) is widely used in the applications of face reconstruction, face reenactment and talking head et al. . However, due to the sparsity of landmarks and the lack of accurate labels for the factors, it is hard to learn the disentangled representation of landmarks. To address these problem, we propose a simple and eﬀective model named FLD-VAE to disentangle arbitrary facial landmarks into identity and expression latent representations, which is based on a Variational Autoencoder framework. Besides, we propose three invariant loss functions in both latent and data levels to constrain the invariance of representations during training stage. Moreover, we implement an identity preservation loss to further enhance the representation ability of identity factor. To the best of our knowledge, this is the ﬁrst work to end-to-end disentangle identity and expression factors simultaneously from one single facial landmark.


§1 Introduction
With the recent advances in deep learning, face representation has gained a growing interest in extensive creative applications such as face reenactment [3,38], avatar animation [26,28], talking face [34,40], et al..The core insight behind face representation is to disentangle face into different factors, such as identity, expression and pose, where identity factor tells who the face is, pose factor states the position and rotation, and expression factor shows what emotion it wears.Please note that the identity and pose factors are sometimes combined as a single factor [35,38].
Current face representation models can be classified into three different types: 2D-based, 3D-based and feature-based.2D-based model, which is generally referred to as landmark [13,15], is a sparse representation of facial shape and expression.3D-based model is often referred to as Figure 1.A toy experiment of our proposed FLD-VAE network.our images were selected from RaFD dataset in each corner with different identities and expressions (top-left: a woman with a neutral expression, top-right: a man with smile, bottom-left: a girl with smile, bottomright: a boy with fear).The landmarks in the dotted box are extracted from images, and the other landmarks are generated by our proposed method with bilinear interpolated identity and expression latent code.3D Morphable Face Model (3DMM) [1,12,23,25], which is a parameterized representation for 3D face or head mesh, has a benefit for manipulating 3D mesh with different parameters (e.g., identity, expression, illumination and textures).Feature-based model is to encode face images into a latent space, which is usually used in face recognition applications [9,31].
Landmark is simple and effective among the three face representations.However, different from 3DMM which naturally models disentanglement factors as parameters, how to disentangle landmark into different semantic factors (e.g., identity and expression) is still a challenging problem because the following reasons: 1) landmark is sparse and discrete in 2D space, which makes it hard to model the information of identity and expression factors, 2) there is lack of sufficient explicitly supervised data with accurate labeling of both identity and expression factors.Recently, Xiang et al. proposed a two-stage landmark disentanglement network (LD-Net) [35], in which the first stage encodes a stable expression code from the landmark with identity labels, and the second stage generalizes to predict identity code from landmarks instead of using identity labels.Zhang et al. proposed a landmark convert module (ULC) to convert the expression from the source identity to the target identity [38].These methods are focused on extracting either identity or expression factors from source or target images, without building a unified representation including both parts from one single facial landmark.
To address this problem, we propose an end-to-end generation network based on the variational autoencoder (VAE) framework.Our method first encodes the landmark into two latent representations, which represent identity factor (here it refer to the hybrid of the identity and pose factors) and expression factor separately, and then combines the two factors to decode the landmark itself as the supervision.In this way, we can easily generate the desired landmark by combining identity and expression factors from different source landmarks, and transfering the expression of a source person to a target person becomes convenient.Although it is hard to accurately annotate continuous labels of identity and expression, it is fairly easy to tell if two images are the same person or two different people have the same expression.By exploiting the invariance property of these factors, we propose three invariant loss functions in both latent and data levels.At the latent level, different landmarks with the same identity (or expression) should be encoded into the same identity (or expression) latent code.At the data level, a reconstructed facial landmarks should be similar to other facial landmarks with similar identity and expression.Besides, we propose an identity persevering loss to push that the distance between different landmarks with the same identity is closer to the same expression.With these four designed loss functions, our model can achieve effective reconstruction and disentanglement performance, which is qualitatively and quantitatively verified on three datasets.To the best of our knowledge, our method is the first model to disentangle 2D facial landmarks into identity and expression representation.
We summarize our contributions as follows: • We first present a novel facial landmark disentanglement network in 2D space that encodes landmarks into latent representations of identity and expression.
• We design four loss terms on the invariance of identity and expression factors, and the identity preserving property to endow our proposed model with the ability of facial landmark disentanglement.
• Extensive qualitative and quantitative experimental results demonstrate the effectiveness of our method over LRW, GRID and RaFD dataset.§2 Related Work Disentangled Representation Learning Disentangled representation learning aims to discover a set of generative factors, whose element encodes unique and independent factors of variation in data [22].Motivated by its attractive compositional properties and theoretically efficient generalization ability, a number of models that seek to learn disentangled representation in a weakly-supervised or self-supervised manner have been proposed, such as DC-IGN [20], InfoGAN [5], and β-VAE [14].In recent years, extensive disentangled representation methods have been implemented in face reenactment [3,38] and talking face [34,40] applications.Zhou et al. [39] proposed a disentangled audio-visual system (DAVS) for talking face generation by decoupling face sequences into subject-related and speech-related information.Lee et al. [22] proposed an end-to-end ID-GAN network to synthesize high-fidelity face images by combining a VAE-based disentanglement model and a GAN model.Jiang et al. [16] proposed a disentangled representation for 3D face mesh which uses two parallel VAE networks for extracting identity and expression separately.
Face Disentangled Representation To meet the needs of different facial tasks, a lot of face representation models have been developed in recent research area, which fall into As the most representative 2D-based model, landmark is conventional for representing facial shape and expression that is widely used in face editing and image synthesis.Typical forms of landmark include five-points, 68-points, 98-points and more keypoints shapes [2,13,15,18].The most classic 3D-based model is 3D morphable model (3DMM) which was first presented at SIGGRAPH'99 [1].It adopts a collection of parameters (e.g., identity, expression, pose, texture and illumination) to fully represent a 3D face.There are a large amount of efforts improving 3DMM modeling mechanism, such as BFM [12,25], FLAME [23], CoMA [27], et al.. Besides, there are also extensive face reconstruction methods to estimate 3DMM coefficients from a single face image, such as [10,11,29].Feature-based representation maps a face image into a latent code, such as Eigenface [31] and ArcFace [9].

Face Identity Preserving
The identity preserving works aims to preserve the identity of the person while other attributes are controlled or transferred by a different person, which has been used in voice conversion [32], personalized motion retargeting [36,37] and expression transfer [33,38].Sinha et al. [30] proposed a four-stage method in the landmark level for realistic talking face generation, which retargets the person-independent landmarks to person-specific landmarks by preserving the identity-related facial structure.Zhang et al. proposed FReeNet to convert the expression from the source landmark to the target by a unified landmark converter module, and then use GAN to generate image [38].Nitzan et al. did a similar work directly in the image level, which encodes the identity of the source image to a latent space I id and all the other attributes of the target image to a latent space I attr first, and then concatenate them to synthesize a realistic image with a StyleGAN network [17].Recently, Xiang et al. proposed a two-stage LD-Net to disentangle the identity from the head pose and expression [35].These methods extracted only one factor (e.g., identity or expression) from either the source or the target image, without the ability to disentangle multi-factors simultaneously.On the contrary, our method is an end-to-end framework that can learn both of the identity and expression information as latent code from one single facial landmark.§3 Method In this section, we aim to design a disentangled representation model for 2D facial landmarks.Given a dataset D = {x i } N that consists of N facial landmarks x i ∈ R 68×2 extracted from N images, where 68 is the number of landmark key points in our experiment.To achieve disentanglement, we assume that each landmarks can be represented by a latent variable z, which was composed of two orthogonal parts: identity z idn and expression z exp .
The priors of the identity and expression latent variables z idn and z exp are both defined as a diagonal Gaussian distributions (Equation 1 and 2).In this case, the prior of z can be expressed as Equation 3, and the marginal likelihood of a facial landmark x is conditioned on z idn and z exp , which can be written as Equation 4, where µ * and σ * are the mean and standard variation of z * respectively, θ represents the parameters of the generative model, and x )) is the conditional distribution of x modeled as a multivariate Gaussian with a diagonal covariance matrix.
Since the marginal likelihood p θ (x) and the true posterior density p θ (z idn , z exp |x) are intractable, we use VAE model to approximate the true posterior p θ with an inference model q ϕ given by q ϕ (z idn , z exp |x) = q ϕ (z idn |x)q ϕ (z exp |x)) (5) where ϕ represents the parameters of the encoder network (the left part of our FLD-VAE model in Figure 2).And q ϕ (z idn |x) and q ϕ (z exp |x) are the identity and expression latent space with their mean (µ) and variation (σ) inferred by the encoder network.

Faical Landmark Disentangled Network
The pipeline of our proposed Facial Landmark Disentangled Network with Variational Auto-Encoder model (FLD-VAE) is shown in Figure 2. Our network adopts the classical VAE framework with an encoder and a decoder, where the difference is that the output of our encoder is divided into two branches to represent identity and expression latent variables.Besides, the input of the network is a 2D facial landmark x ∈ R 68×2 .
In the encoder module, a four-layer multi-layer perception (MLP) is used to extract the feature of the input landmark, which is then followed by an identity embedding network E idn and an expression embedding E exp network.In each embedding network, we apply an MLP network to produce a diagonal Gaussian distribution with mean and log standard variations.Besides, ReLU (•) is used in the encoder as the activation function [8].Overall, the encoder can be formulated as two functions: E idn maps the input landmark x to µ idn and log σ 2 idn , the Gaussian distribution parameters of identity latent space, and E exp maps x to µ exp and log σ 2 exp , which can be expressed as, The decoder module, denoted as D ldmk , is simply a 4-layer MLP to reconstruct the facial landmark x from the identity and expression latent variables z idn and z exp (Equation 8).During the decoding, we first apply the reparameterization trick [19] to sample z idn and z exp from their latent space (Equation 9 and 10), and then concatenate them together as the input of our decoder network.
x = D ldmk (z idn , z exp ) (8) 10) where x is the reconstructed landmark, and sample(•) is a sample function from a distribution.
To learn this disentangled VAE model, we optimize the variational lower-bound L ELBO on the marginal likelihood of data, which is composed of a reconstruction term and two Kullback-Leibler (KL) divergence terms over identity and expression latent space as expressed in the following: where the reconstruction term (the first term of Equation 11) forces the input and output to be the same, and the two KL terms force the distributions p(z idn ) and p(z exp ) to be as close as the identity and expression posteriors q(z idn |x) and q(z exp |x) respectively.β idn and β exp are hyper parameters that balance latent capacity and independence constraints with reconstruction accuracy [14].

Objective Functions
Besides L ELBO , we propose four objective functions to train our model.In order to disentangle facial landmarks into identity and expression with our proposed FLD-VAE model, two problems should be solved: 1) How to disentangle the two factors from each other and keep themselves consistently.2) How to learn the semantic information of each factor, that is let each latent representation character its corresponding factor's information.To address these problems, we propose two invariant losses to keep disentangling in latent-level L LIL and data-level(L LDL ), one cycle invariant loss (L CIL ) to make each factor invariant when the other exchanged, and an identity preserver loss L IP L to preserve identity with triplet loss function.

Latent-level invariant loss.
The key aim of our model is to disentangle identity and expression information from the facial landmark representation.We observe that one landmark should keep invariant when its latent code of one factor was replaced by another landmark that has the same factor.That is to say, the identity latent code should be invariant to expression changes when we decouple out identity information.At the same time, the expression latent code should be invariant to identity changes when we decouple out expression information.So that we can easily introduce a latent-level invariant loss (show at Figure 3) to make a constraint on identity and expression latent variable by, Data-level Invariant Loss.With the same observation of L LIL , we introduce a data-level invariant loss (L DIL ) to constraint reconstructed landmarks.Specifically, as the identity latent code of x A,n and x A,m are similar, we expect the two reconstructed landmarks from them with the same expression latent code will be similar.So with the expression latent code that we expect the two reconstructed landmarks from the expression latent code of x A,n and x B,n with the same identity latent code will be similar.The L DIL can be formulated as, Identity Preserve Loss.In experiments we found that the identity latent code is hard to be convergence.The reason is that we combine pose and identity factors as one latent representation, so the identity latent codes from the landmarks with the same identity but different poses will show slightly different.In this situation, we propose an identity preserve loss L IP L to address this problem, where let the distance of the same identity but different pose or expression is closer than the distance of the different identity with same pose or expression.Specifically, we construct a triple pair with (x A,n , x A,m , x B,n ) and implement the identity preserve loss as, where f D (•, •) is a distance function and m is the margin of this triplet loss.In our experiment, we use L1 function as our distance function and let m = 0.1.

Total Loss
The proposed FLD-VAE network can be trained end-to-end with a weighted sum of the loss terms defined above: where λ 1 , λ 2 , λ 3 and λ 4 represent the weight of loss terms L LIL , L CIL , L DIL and L IP L respectively.§4 Experiment In this section, we evaluate the performance of our proposed FLD-VAE model and compare with the state-of-the-art methods on three datasets.Moreover, some ablation studies were conducted to illustrate the effectiveness of four our proposed objective functions.We first introduce three experiment datasets in Sec.4.1 and model implementation detail in section Sec.4.2.And then, we will introduce four metrics for measuring reconstruction and disentanglement performance in Sec.4.3 and show the qualitative and quantitative experiments result in Sec.4.4 and Sec.4.5.At last, we make ablation study in Sec.4.6.

Datasets
In our experiments, three datasets are involved for training and testing our model: GRID.The GRID dataset [7] consists 34,000 high-quality audio and visual videos, which records 1000 sentences spoken by 34 talkers(18 male and 16 female).Each video has 75 frames.In our experiment, total 30,000 videos by 30 talkers used for training and the remaining videos for testing.
LRW.The LRW dataset [6] consists of 272 different words spoken by hundreds of different news speakers, each world has about 1000, 50 and 50 short videos for training, testing and validation stage respectively.In our experiment, we only use 50,000 videos for training and 5,000 videos for testing.
RaFD.The Radbound Faces Database(RaFD) [21] have 8040 images collected from 67 participants.Each participant makes eight emotional expressions (anger, disgust, fear, happiness, sadness, surprise, contempt, and neutral) in three different gaze directions and all pictures are taken from five different angles simultaneously, while only three angles (45, 90 and 135 degree) are used in our experiments.

Implementation Details
In the data preprocessing procedure, we first extract the 2D facial landmark of each video with the off-the-shelf method [18], then crop out the face according to its landmark, and finally normalize the coordinate of each landmark with its new crop region to 0 ∼ 1.At the training stage, we random build a triple pair data (3×16 frames) as input in each mini-batch.Specifically, the first and second 16 frames was a continuous video clip which randomly selected from LRW and GRID dataset respectively.To make sure the identity of each part data among the triple pair are different, the last 16 frames are select from RaFD dataset, which four different identities are random pick up with two different motion and two different head pose.With this input data structure, we can easily optimize our proposed objective function terms.
Our algorithm is implemented with Pytorch library [24].All the training and testing experiments were tested on PC with NVIDIA 2080Ti GPU and CUDA 12.0.In the train stage, the weight of KL loss term β idn and β exp is set to 0.001, the dimensions for z exp and z idn both set to 64, and the weight of loss terms L LIL , L CIL , L DIL and L IP L are set to 0.1, 1, 1 and 0.1 respectively.We train our network for 10,000 epochs per experiment with a learning rate of 1e-4 by Adam optimization method.

Evaluation Metric
The main aim of our proposed method is to disentangle a given facial landmark into identity and expression latent representation as accurately as possible and achieve high reconstruction accuracy at the same time.Therefore, our evaluation metric should include two aspects: reconstruction and disentangle measurement.We apply Mean Absolute Error (M AE) and Landmark Distance (LM D) to measure reconstruction performance, and propose two disentanglement metrics for identity M idn and expression M exp respectively.Mean Absolute Error.MAE is used to measure the landmark reconstruction accuracy with the L1-norm function.
Figure 5. Disentangle result.The first column (row) are source (driven) images to provide identity(expression) information, where all the images are random selected from the LRW dataset.
The second column (row) are landmarks extract from images.The middle landmarks are generated by our proposed model with the identity (from source landmark) and the expression latent code (from driven landmark).Please zoom for more detail.
Landmark Distance.LDM [4] compute the Euclidean distance between two facial landmarks point-wisely in Cartesian coordinates space.
where N denotes the mini-batch size of each batch training, P denotes the number of landmark points, and (x x n,i , x y n,i ) denote the landmark x n,i position in x-axis and y-aixs.In our experiment setting, we use 68 landmark points.
Disentangle Measurement To evaluate the disentangle performance, it is a simple but fundamental idea that measuring what extent each factor keeps invariant when the other factors changed.With this principle, we can design two disentanglement metrics for identity and expression, which denote as M idn and M exp specifically.
The metric of M idn is proposed to measure how successful the model can keep one facial landmarks' identity information invariant when its expression latent code changed.In our experiments, we select seven key points to represent each landmark's identity shape, which located in left and right chin center, eyes center, tip of the nose and two mouth margins.M idn can be formulated as the landmark distance between the ground-truth seven-points landmark x n and the reconstructed seven-points landmark xn whose expression code was changed.
Similarly, the metric of M exp is proposed to measure how successful the model can keep one facial landmarks' expression information invariant when its identity latent code changed.
In our experiments, we simply use the mouth open ratio f OM R to represent each landmark's Figure 6.Interpolation of identity latent code while preserving expression experiment.In each line, two landmarks were extracted from images with different identities and same expression, and the middle landmarks were generated by the interpolation of identity latent code and mean expression.
Figure 7. Interpolation of expression latent code while preserving identity experiment.In each line, two landmarks were extracted from images with different expressions and same identity, and the middle landmarks were generated by the interpolation of expression latent code and mean identity.expression information.The expression measurement can be formulated as, where f OM R (x) is a function of mouth open ratio on landmark x which can be calculated as the proportion of inner mouth height and mouth width.

Qualitative Result
In this section, we perform experiments to evaluate our model qualitatively mainly through two aspects: disentanglement performance and model generation ability.
To demonstrate the disentanglement performance of our proposed FLD-VAE model, we first randomly choose eight driven landmarks which including four identities (each has two kinds of expression) from LRW testing dataset, and four source landmarks which including two identities (each has two kinds of expression), and then decode new landmark with the identity latent code of source landmarks and the expression latent code of driven landmarks respectively.The disentanglement result is shown in Figure 5. From the result, we can find that our model can well preserve the source landmarks' identity shape, and also can well drive landmarks' expression information, which shows that our FLD-VAE has better disentanglement ability.
To further illustrate the generation ability of our landmark VAE framework model, we conduct experiments on the RaFD dataset to show the generation result by interpolating each disentanglement factor's latent code.Specifically, to show the generation ability over identity latent code, we first randomly pick two identity landmarks with same expression, and then interpolate several identity codes between the two landmarks, and finally generate new landmarks with it and mean expression latent code.The experiment result is shown in Figure 6, which demonstrates that our model has better generation ability over identity factor.Similar to above, we interpolate expression latent code to generate new landmarks with preserving identity.This experiment result is shown in Figure 7, which demonstrates that our model also has better generation ability over expression factor.From these two experiments, we can declare that our model show better generation ability inherit from VAE model.

Quantitative Result
In this section, we quantitatively evaluate the effectiveness of our proposed FLD-VAE method on GRID, LRW, and RaFD datasets, and compare with baseline method and two competing methods: • Baseline has the same network architecture with our proposed FLD-VAE model, but only trained with L ELBO loss, which shows the base performance without disentanglement ability.
• AE is an ordinary autoencoder network to disentangle facial landmarks into identity and expression latent code, which has similar architecture of our model.The difference existing in that the latent representation of variational autoencoder is one distribution but the latent representation of autoencoder is a common latent vector.We train AE method with all of our proposed objective functions.
• ULC was proposed by Zhang et al. [38], which contains two landmark encoders and a landmark shift decoder to convert the expression information of source landmark to a target landmark.
• Ours is our propose FLD-VAE model, which described in Section 3.
With above mentioned competing methods, we evaluate these methods around reconstruction and disentanglement performance with four metrics M AE, LDM , M idn , and M exp .For comparison variety and fairness, we conduct experiments among LRW, GRID, and RaFD datasets, which shown in Table 1.The result of experiments on GRID testing dataset shows that our FLD-VAE model achieves the best reconstruction performance and expression disentanglement ability, while has a slightly lower performance on identity preserving than ULC model.A similar experiment result is shown in the RaFD testing dataset, in which our model surpasses the other methods in LM D, M AE, and M exp .In the LRW testing dataset, we achieve the best performance in overall metrics.At last, These comparison result further suggest that our method can achieve a better trade-off between reconstruction performance and facial landmark disentanglement ability.Please note that the ULC network aims to transfer a source landmark's expression to a target landmark, so we do not evaluate its reconstruction performance and marker "-" in its LM D and M AE columns.

Ablation Study
In our FDL-VAE model, four objective functions were proposed to improve the ability of facial landmark disentanglement, which including Latent and Data level Invariant Loss (L LIL and L DIL ), Cycle Invariant Loss (L CIL ), and Identity Preserve Loss (L IP L ).To explore the significance of each proposed objective function term, we conduct a quantitative evaluation of ablation study on LRW dataset, which the result shows on Table 2.
In the ablation study, we denote Baseline as the FLD-VAE model which only trained by ELBO term L ELBO , and +L CIL as the FLD-VAE model which trained by ELBO loss L ELBO and additional Cycle Invariant Loss L CIL , and so with the other methods in the Table 2.The F ullM odel is trained with all the objective function terms.We note that Baseline experiment shows the basic performance of reconstruction and disentanglement.Compare with Baseline, the result of +L CIL , +L LIL and +L DIL show them have ability to disentangle landmark and help reconstruction, where data-level invariant loss L DIL can push z idn and z exp learn semantic information and L LIL have better performance on identity disentanglement.The experiment of +L IP L introduces a worse result than Baseline on M idn and M exp , which may be its effectiveness is hard to come out when trained alone.To explore this guess, we train our model with L DIL and L IP L loss to conduct +L DIL + L IP L experiment, whose result shows better than +L DIL and +L IP L in improving identity and expression disentangle performance, and also proves L IP L loss can force z idn carry identity-related information and preserve it.Moreover, we combine with L CIL loss to conduct +L DIL + L IP L + L CIL experiment, which further improves identity disentangle performance with a slight cost of lowering the expression disentangle ability.At last, we combine the whole loss terms to produce Full model experiment.The result shows that its reconstruction ability reduced slightly and disentanglement capability stronger than other's experiments.It further illustrates that balancing the reconstruction and disentanglement performance is still hard in VAE-based frameworks.§5 Conclusion In this paper, we propose a facial landmark disentangled network based on variational autoencoder framework, which can able to learns disentangled identity and expression representation in 2D space.To endow the disentanglement representation ability of our proposed model, we propose four objective functions depending on the invariance of identity and expression factors.The reconstruction performance and disentanglement ability of our proposed model are qualitatively and quantitatively verified on three datasets.We also conduct ablation studies to evaluate the effectiveness of the proposed loss functions.In the future, we plan to extend our model to sequential 2D/3D landmark disentangled representation and involving more disentanglement factors.

Figure 3 .
Figure 3. Illustration of latent-level(L LIL ) and data-level invariant loss (L DIL ), where x A,n and x A,m have same identity but different expression, x A,n and x B,n have same expression but different identity.L LIL pushes the landmark with same factor to have the same latent code, and L DIL pushes the landmark reconstructed by the same latent code to be the same.

Figure 4 .
Figure 4. Illustration of cycle invariant loss (L CIL ).This illustration only shows the expression latent code keeps invariant in the re-encoder stage when they were exchanged before decoded.

Table 1 .
Comparison result of quantitative performance on three datasets with four metrics: LM D, M AE, M idn and M exp .All these metrics are the lower value the better.

Table 2 .
Ablation study on the LRW dataset.