Disentangled Representation Transformer Network for 3D Face Reconstruction and Robust Dense Alignment

.


Introduction
3D face reconstruction is an essential topic in computer vision and graphics.The human face is one of the most discriminatory parts of the human body, and the facial features and contours contain many attributes and semantic information.The single-view 3D face reconstruction targets recovering a whole face geometry shape based on the given single-view image, which plays a significant role in many visual analysis applications, such as face recognition [1], face verification [2], facial expression analysis [3], and facial animation [4].
However, many face images are affected by the shooting angle and the surrounding environment.Problems such as partial occlusion of faces and facial blurring occur, resulting in distortion of reconstructed 3D faces.Therefore, how to perform accurate 3D face reconstruction and facial landmark alignment under large face poses, partial occlusion, and insufficient illumination is a problem that still needs to be solved.Conventional methods are mainly based on optimization algorithms, e.g., iterative closest point [5], shape from shading [6,7], and photometric stereo [8].Nevertheless, the problems of locally optimal solutions, short model initialization, and high optimal complexity of these facial reconstruction techniques make the 3D face reconstruction process complex and costly.
With the rise of deep learning in recent years, CNN network regressionbased methods [9,10] have emerged to achieve remarkable success in face 3D reconstruction and dense alignment.t is very challenging to reconstruct 3D face shapes from 2D images without prior knowledge.This is mainly because 2D data needs to convey clear-depth information.A common approach to solving the single-view 3D face reconstruction problem is using a set of 3D base shapes to capture the subspace or constructing a morphological model of face shape variation.Feng et al. [11] proposed a simple convolutional neural network for 3D face reconstruction and dense alignment that regresses the UV position map from a single 2D image and records the complete 3D face shape in UV space.The method does not rely on any a priori face model and can reconstruct the entire face geometry based on the semantic information of the face.Jackson et al. [12] pointed out that existing 3D face reconstruction methods are affected by the pose, expression, and illumination of the input face photos and need to be more complex and efficient in fitting the face model.So the method bypasses the construction of a 3D deformation model and regresses the 3D facial geometry using a single 2D facial image.However, these methods cannot explicitly capture information about individual attributes of the face, Fig. 1 Conventional 3D face reconstruction and our DRTN framework.Inside the blue dashed box is the regression of face attribute parameters independently using the traditional method.In the red dashed box is the DRTN method using a correlation between face attribute features to reinforce the learning of attribute parameters by the network.
for single-view 3D reconstruction and face alignment.Some current approaches [14] are mainly decomposing the face attributes and individually estimating their shape, expression, and pose parameters.Although this enhances learning a single-face attribute, these methods do not consider the interaction between features.Motivated by extracting shape, expression, and pose parameters from 2D images and fusing the facial parameters to capture the chinaXiv:202301.00073v3dependencies between the attributes, which is useful for 3D face reconstruction and alignment in environments such as large pose, self-obscuration, and poor lighting.Our approach aims to achieve complementary feature information by decomposing the face attribute information to enhance the correlation between individual attributes.To achieve this, we carefully design disentangled representation transformer network (DRTN), which includes identity, expression, and pose branches.The branching of the identity component aims to enhance the learning of expression and pose attributes by preserving the overall face geometry structure and keeping the identity intact.Accordingly, the expression and pose parts of the branch maintain the consistency of expression and pose attributes, respectively.It helps refine the reconstruction and alignment of facial details in large poses, mainly by coupling other facial attribute parameters.The network parameters of the designed attribute branch architecture are optimized by backpropagation in an end-to-end manner.Fig. 1 shows in the pipeline of our proposed DRTN.Experimental results show that the method significantly outperforms traditional independent regression of attribute parameters in 3D face reconstruction and landmark alignment and exhibits very competitive performance on the test dataset.The contributions of our work are summarized as follows: (1) We develop a decomposed representation learning method for faces to model the correlation between face attributes explicitly.The method reduces the ambiguity of facial attribute learning from traditional CNN-based parameter regression.It enhances the learning of face attribute information enabling semantic editing of identity, expression, and pose domains.
(2) We propose a novel disentangled representation transformer network framework based on single-view 3D face reconstruction and alignment.Our framework uses face attribute branching regression for the representation, independently regressed from the identity, expression, and pose components.Based on identity attribute consistency, the identity branch introduces pose and expression attribute information to enhance the integrity of the geometric facial profile.Accordingly, the branches of the expression and pose parts refine the expression and landmark alignment effects of the 3D face by coupling other attribute parameters while maintaining the consistency of each attribute.
(3) To further improve the performance, our transformer network is designed to address the problem of missing details of face geometric contours.The depth model's capacity is effectively controlled, and complementary information is extracted from the multi-attribute face input.This aims to establish similarities between shallow and deep representations of faces and to mine local attribute information of faces using the implementation of global information interactions.

Related Works
This section briefly reviews the existing 3D facial alignment methods and reconstruction techniques.

Face Alignment
Face alignment in computer vision is a long-standing and widely discussed problem.Initially, some 2D face alignment methods, which fit the face shape of a given input image by constructing an overall fitting template, aimed to locate a set of baseline 2D facial landmarks.Representative methods of this type include active shape models (ASM) [15], active appearance models (AAM) [16], and constrained local models (CLM) [17].However, when the pose is large, the computation could be faster and more accurate in describing the face shape due to multiple classifier regressions to locate the landmarks of 2D faces.
Recently, work has used deep learning to study 3D face alignment in large poses.Among the methods for 3D face alignment is the direct regression of face parameters bypassing the 3D deformation model and fitting a dense 3DMM with CNN cascade regression.For example, Amin Jourabloo et al. [18] designed a cascaded coupled regression method to estimate the camera projection matrix and 3D face landmarks by integrating a 3D point distribution model.Adrian Bulat et al. [19] proposed a heat map regression-based method to estimate 3D face landmarks.Each landmark corresponds to a heat map in this method, and these heat maps are regressed with the input RGB images by learning the 3D face depth values through a residual network.A video-based 3D cascade regression method has been developed by Jeni et al. [20].A dense 3D shape is generated in real time from an input 2D face image.The algorithm estimates the position of the dense set of markers and their visibility and then achieves dense 3D face alignment by fitting a partial 3D model.Xin Ning et al. [21] proposed a real-time 3D face alignment method that uses an encoder-decoder network with efficient convolutional layers to enhance information transfer between different resolutions in the encoding and decoding stages and achieve advanced performance.
Several works have performed face alignment by fitting 3D variable shape models (3DMM) to 2D facial images.[22] includes a 3DMM by a single CNN in an iterative manner, while the CNN augments the input channel with the represented shape features in each iteration.[23] uses multi-constraint training CNNs to estimate 3DMM parameters and then provides very dense 3D alignment.Lei Jiang et al. [24] proposed a dual-attention mechanism and a practical end-to-end 3D face alignment framework.A stable network model is constructed by deep separable convolution, densely connected convolution, and light channel attention mechanism.This model-based 3D reconstruction method can easily accomplish the task of 3D face alignment.However, these methods only use deep networks to directly learn the relationship between 2D input images and 3D models and do not consider the integrity of the correspondence between face attributes.To address this, our model decomposes face attributes by a unified deep learning architecture and automatically extracts useful feature information directly from pixels using complementary information between face attributes.This approach inferred face contours in invisible regions and improved face alignment in large poses.chinaXiv:202301.00073v3

3D Face Reconstruction
Initially, 3D face reconstruction was mainly used in medical applications for human head diagnosis.Facial reconstruction is primarily done by scanning the face with a 3D scanner [25,26] to obtain the face's shape, structure, and texture information to reconstruct a full 3D face.Although it is more accurate to reconstruct 3D faces using scanners, the whole implementation process is more complicated and time-consuming.Therefore, Volker Blanz et al. [27] proposed the 3D morphable model (3DMM).The 3D morphable model is built based on a 3D face database, with face shape and texture statistics as constraints.The influence of pose and illumination factors of the facial reconstruction process is considered to make the generated 3D face model more accurate.Later, Paysan et al. [28] proposed a basel face model (BFM) based on the original 3DMM with improved alignment algorithms for higher shape and texture accuracy.However, most of these methods establish the correspondence of vertices between the input image and the 3D template and then solve the nonlinear optimization function to regress the 3DMM coefficients.Therefore, these methods rely heavily on the accuracy of landmarks or other feature point detectors.
Due to the wide application of deep learning in various fields, many works have recently utilized CNNs to directly predict face parameters for reconstruction.Elad et al. [29] introduced an end-to-end convolutional neural net framework, which generates the geometry of a face in a coarse to fine manner.Subsequently, Dou et al. [30] proposed a deep neural network-based approach to improve facial expression reconstruction by integrating multi-task loss functions and fused convolutional neural networks into a DNN structure.This approach avoids the complex 3D rendering process, but the reconstruction process is only valid for frontal faces.Tran et al. [31] made a nonlinear improvement to the traditional linear 3DMM model by performing end-toend learning in a weakly supervised manner.Lee et al. [32] proposed to use an uncertain-perception encoder that effectively combines graph convolutional neural networks and generative adversarial networks.Wu et al. [33] offers a process for learning 3D deformation models from original single-view images without external supervised learning.Browatzki [34] proposes a semisupervised approach.The key idea is to generate implicit face information from many existing unlabeled photos.Feng et al. [11] used UV maps to map 3D shapes to 2D images for representation and then constructed 3D face shapes.However, these methods could perform better with large poses and strongly occluded faces.The Jackson et al. [12] method bypasses the problems associated with the construction and fitting of 3D deformation models by using a single 2D face image to de-regress the volume of the 3D face geometry for face reconstruction.The method is no longer restricted to the model space but requires a complex network structure and much time to predict the voxel data.Recently, some works decompose a given 3D face into identity and expression parts and encode them nonlinearly to achieve 3D face reconstruction with remarkable results.However, these methods do not consider the interaction between face attributes and only estimate the details individually in the process of parameter regression.Unlike the above methods, our approach enhances the semantically meaningful face attribute representation.It directly obtains the complete 3D face geometry and its corresponding information by learning the correlation of different 3D face attribute parameters.

Proposed Method
In this section, we first introduce a 3D face model with potential representations and design our approach based on it.Then we propose a transformer-based joint learning pipeline for encoders and decoders.Finally, specific implementation details of this face attribute branching method are given, including the network structure, training data, and training procedure.

A Composite 3D Face Shape Model
The 3D morphable model is one of the most successful methods for describing 3D face reconstruction.In this work, we adopt a common practice in the 3D morphable model (3DMM) [27], representing a 3D human face as a combination of shapes and expressions.The concatenation of its vertex coordinates represents each 3D face shape as: where n is the number of vertices in the point cloud of the 3D face, and T is transposed.S i = (x i , y i , z i ) denotes the coordinates of (x, y, z) in the cartesian coordinate system.In this paper, we use a 3D morphable model (3DMM) [27] to recover the 3D geometry of a human face from a single image.The 3DMM principal component analysis (PCA) model to represent the face geometry S Model as: where S ∈ R 3n is the mean geometry.∆S id is the identity-sensitive difference between S and S Model , and ∆E exp dnotes the expression-sensitive difference.
A id and α id are the identity base and identity parameters of the face.B exp and β exp are the expression base and expression parameters of the face.S and A id are learned from Basel Face Model [28] and B exp is obtained from FaceWarehouse [36].In the process of 3D face fitting, we use a weak perspective projection to project the 3D face onto the 2D face plane.This process is denoted as follows: where C is the geometry projected in image coordinates, f is a scale factor, P r = 1 0 0 0 1 1 is an orthogonal projection, R is a rotation matrix consisting of 9 parameters, and t is a translation vector.We can then transform the chinaXiv:202301.00073v3 3D face reconstruction problem into a face parameter regression problem.We have 62 parameters to regress on the 3D face model, where the pose parameter v pose = [f, R, t], so the set of all model parameters is p = [v pose , α id , β exp ] T .
Fig. 2 Model overview.Our transformer module consists of an encoder and a decoder.In the encoder, each attribute information of the face is extracted from the face image by a multi-layer convolutional neural network.This attribute information is then passed through the transformer block to achieve uniform encoding of attribute information.Then, for the encoded face feature sequence, we will use the decoder module to extract each face attribute information in a multi-head attention mechanism and output the face attribute information through the fully connected layers.

Transformer Module
Previous approaches [23] used Disentangled representation to extract individual face attributes from face images.However, this approach can simultaneously model shape, expression, and texture by decomposing the underlying representation into relevant factors such as identity and expression while preserving identity information.Therefore, we use an encoder to extract three different sets of features, i.e., identity, expression, and pose.Inspired by [37], the Transformer structure can extract global information from the input features and implement information exchange within each entry.In contrast, the tiny sensory field of convolutional neural networks can only mine local information.We introduce the Transformer module inside the encoder and decoder to effectively combine global and local information.This module can obtain the global representation from the shallow layer and extract the geometric structure information of the face.In addition, this module can obtain more similarity of face attributes between shallow and deep representations and extract deep face semantic information.
As shown in Fig. 2, we introduce the transformer module inside the encoder and decoder, which plays a vital role in the 3D face space.The semantic features in the image are extracted layer by layer inside the encoder in a structured way with a multi-layer convolutional connection transformer block.Specifically, the encoder module is divided into four layer layers and a transformer structure.Firstly, the input face image is convolved by Conv 3 × 3 to obtain the initial features of the face.Then the encoder network extracts the initial face features layer by layer to gradually increase the number of face features channels.Finally, the transformer block is mapped to the high-dimensional space, where the structure encodes the face as a series of 2048 × 7 × 7 feature sequences.In addition, the traditional transformer structure uses a self-attentive mechanism to directly compute the attention weights at each position of the sentence in the encoding process by some operations and then compute the implied vector representation of the whole sentence in the form of a sum of weights.However, the drawback of this self-attentive mechanism is that the model overly focuses on a single attribute position of the face when encoding the information at the current position.Therefore, we use a multi-headed attention mechanism to learn the different attribute behaviors of faces and then combine the behaviors with additional face attributes as knowledge.The aim is to capture the dependencies between individual attributes of the face within a sequence and improve the subspace representation.
Besides, our module preserves the face image's local features not to destroy its appearance's spatial structure.In the decoder, we do not encode the image with position vectors as in the TRT-ViT [37] approach but convolve the pre-passed face feature sequence by Conv1 × 1 and then input the obtained feature sequence into the transformer structure.Immediately afterward, the transformer module dimensionally transforms the input face attribute features and checks the distribution of each face attribute of the feature space, then calculates the correlation between face attribute feature points.

Attribute Decomposition Representation
Single-view 3D face reconstruction is to obtain estimates of shape α id , expression β exp , and pose parameters v pose given an input image I.A wide variety of attribute sources in the face may lead to variability in facial reconstruction.For example, the distortion of the geometric contours of the face that expressions may cause.Therefore, our goal is to obtain the representative values of each potential attribute from the face image by a function F i has given the input image I as: where α id , β exp , and v pose are the identity, expression, and pose parameters involved in F i .Usually, the potential face attributes represent Ēid , Ēexp , and Ēpose in a much lower dimension than the input 2D face image I and the output 3D face shape S. Alternatively, previous approaches extracted shape, chinaXiv:202301.00073v3expression, and pose attribute information independently given the input image I, i.e., Ēid = F id (I; α id ), Ēexp = F exp (I; α exp ) and Ēpose = F pose (I; α pose ).However, this simple feature extraction does not consider the correlation between face attributes but only decomposes the low-dimensional 3D faces between face attribute variables.We designed a disentangled representation transformer network (DRTN) to solve this problem.The network structure is shown in Fig. 3, and the dependencies between face attributes are learned by regression from the identity, expression, and gesture branches, respectively.The specific learning of face attributes for each branch is shown as follows.
Identity branch: One of the branches of the identity component aims to enhance the learning of expression and pose attributes by preserving the overall face geometry structure and keeping the identity unchanged.In the identity branch, we explicitly model the individual face attribute dependencies.Where the joint expression and pose attributes are decomposed under the condition that the identity attribute Ēid = F id (I; α id ) is consistent as: Ēid,exp,pose = F id,exp,pose v pose ; I, Ēid , Ēid,exp (6) Ēid,pose = F id,pose v pose ; I, Ēid (7) Ēid,pose,exp = F id,pose,exp β exp ; I, Ēid , Ēid,pose We formulate the learning process of these parameters with three autoencoders E id , E exp , and E pose .Ēid is the identity attribute information learned from the input image I after passing through the E id .Ēid,exp is the expression information obtained by E exp encoder learning on the basis of the identity attribute Ēid .The Ēid,exp,pose is the pose information learned by the E pose encoder on the basis of Ēid,exp .Similarly, Ēid,pose is the pose information obtained by E pose encoder learning on the basis of the identity property Ēid .Ēid,pose,exp is the expression information obtained by E exp encoder learning on the basis of Ēid,pose .F i (•) is the learnable encoder among the autoencoders that have gone through different orders.
Although this method can effectively solve the problem that information between face attributes cannot interact with each other, the parameter estimation using this sequential decomposition method is somewhat scattered.In this problem, we use the coupled variable method to fuse the expression and posture attribute information under the condition of identity invariance, and the fused attribute representation value as : T(α id ,vpose,βexp) = Ēid ⊗ Ēid,pose ⊗ Ēid,pose,exp (10) where T (α id , β exp , v pose ) and T (α id , v pose , β exp ) represent the facial expression and pose features obtained after the facial attributes pass through different encoder networks based on identity consistency.where ⊗ is the element-wise hadamard product.
Expression branch: Correspondingly, the branch of expression part couples the identity and pose attributes to refine the face facial details reconstruction while preserving the consistency of expression attribute Ēexp = F exp (I; β exp ).Its joint attribute decomposition is expressed as: Ēexp,id,pose = F exp,id,pose v pose ; I, Ēexp , Ēexp,id (12) Ēexp,pose = F exp,pose v pose ; I, Ēexp (13) Ēexp,pose,id = F exp,pose,id α id ; I, Ēexp , Ēexp,pose (14) where Ēexp is the expression attribute information obtained from the input image I after E exp encoder learning.Ēexp,id is the identity information obtained by E id encoder learning on the basis of the expression attribute Ēexp .Ēexp,id,pose is the pose information obtained by E id encoder learning on the basis of Ēexp,id .Similarly, Ēexp,pose is the pose information obtained by E pose encoder learning on the basis of the expression property Ēexp .Ēexp,pose,id is the identity information obtained by E id encoder learning on the basis of Ēexp,pose .Simultaneously, the symbolic values of their identity and pose attributes under the condition of constant expression attributes are : T(βexp,vpose,α id ) = Ēexp ⊗ Ēexp,pose ⊗ Ēexp,pose,id (16) where T (β exp , α id , v pose ) and T (β exp , v pose , α id ) represent the face pose and identity features obtained after the facial attributes pass through different chinaXiv:202301.00073v3encoder networks based on expression consistency.
Pose branch: The branching of the pose part aims to improve the alignment of 3D face landmarks by coupling the identity and pose attributes while preserving the consistency of the pose attribute Ēpose = F pose (I; v pose ).Its joint attribute decomposition is expressed as: Ēpose,id = F pose,id α id ; I, Ēpose Ēpose,id,exp = F pose,id,exp β exp ; I, Ēpose , Ēpose,id (18) Ēpose,exp = F pose,exp β exp ; I, Ēpose (19) Ēpose,exp,id = F pose,exp,id α id ; I, Ēpose , Ēpose,exp (20) where Ēpose is the pose attribute information obtained from the input image I after E pose learning.Ēpose,id and Ēpose,exp are the identity and expression information obtained by E id and E exp encoders on the basis of the pose attribute Ēpose .Ēpose,id,exp is the expression information obtained by E exp encoder learning on the basis of Ēpose,id .By the same token, Ēpose,exp,id is the identity information obtained by E id encoder learning on the basis of Ēpose,exp .Therefore, the standard attribute representation value of identity and expression under the condition of the constant pose is : T(vpose,βexp,α id ) = Ēpose ⊗ Ēpose,exp ⊗ Ēpose,exp,id (22) where T (v pose , α id , β exp ) and T (v pose , β exp , α id ) represent the facial expression and identity features obtained after the facial attributes pass through different encoder networks based on pose consistency.
Fusion module: To reconstruct a more realistic and complete 3D human face.We extract the face information related to each attribute from the identity, expression, and pose branching networks, respectively, and use the fusion module to merge the face attributes to complete an accurate 3D face reconstruction.The purpose of doing so is to use this module to ensure the validity of face attribute decomposition.In addition, the face images that pass through the low-level network usually have higher resolution and contain more transparent information.So, we'd like to introduce each branch's shallow face attribute information in the fusion module to refine the face details better.Which the face attribute features of the fusion module are represented as: T(βexp,α id ,vpose) , T(βexp,vpose,α id ) , T(vpose,α id ,βexp) , T(vpose,βexp,α id ) where T i (•) denotes the feature information after the fusion of face attributes, R(•) denotes the fusion of identity, expression, and pose attributes of the face, Gi(•) is the feature output after fusion of face attributes.

Objective Loss Function
In the face attribute branching network, we introduce four learning objectives in model training.Among them, in learning face attribute parameters, we mainly constrain from three parts: identity, expression, and pose.To improve the network's effective learning of identity attributes, we optimize the parameters α id by minimizing the vertex distance between the predicted face geometry and the ground truth 3D face as: where α id denotes the predicted face identity parameter, and ᾱid is the ground truth parameter.A id is the identity base of the 3D morphable model PCA.In addition, different α id dimensions and singular values affect the face geometry differently.Therefore, among the identity branches, the constraint of L id identity information can reduce the influence of essential dimensions, especially those with large singular values.Similarly, to refine the expression details of the face, we use the expression consistency loss to enhance expression preservation: where β exp denotes the predicted face expression parameter, and βexp is the expression ground truth parameter.B exp is the expression base of the 3D morphable model PCA.Since the pose of a face has limited degrees of freedom, to simplify the computation, we constrain the pose estimation by the loss of facial landmarks: where is the ground truth of the pose parameter.Thus, we further improve the face landmark alignment accuracy under significant pose conditions by constraining the pose parameters.Although the L id , L exp , and L pose loss functions have strong constraints on the identity, expression, and pose attributes of the three branches, the reconstructed 3D faces still lack the constraints of geometric contours.Therefore, we use the constraints of sparse 2D face landmarks to improve the reconstructed geometric contour information as: The final loss function L is defined as: where λ id , λ exp ,λ pose and λ 68 are the weights to balance these constraints.

Experiments
In this section, we perform several experiments and ablation studies on DRTN using different settings on three extensively evaluated datasets, 300W-LP [22], AFLW2000-3D [22], and AFLW [38], to demonstrate the effectiveness of the method in 3D face reconstruction and dense face alignment.In addition, we further evaluate the generalization performance of the DRTN method on the LS3D-W [41] and CelebA [40] datasets.

Implementation Details
We use the Pytorch [42] deep learning framework to train the DRTN model.To train our framework, the face regions are cropped according to the ground truth 3D facial landmarks and then scaled to 256 × 256 as the input to our network.Throughout the training process of the branching network, the increase in the number of layers and parameters of the network makes the neural network training slow and occasionally overfitting to the training set.This results in a deep cascade regression network that may not learn any information from the overfitted samples.To better balance, the learning weights of each branch parameter, the loss weights of λ id = 1.0e −3 , λ exp = 1.0e −4 and λ pose = 1.0e −4 are set for training after several experiments.In addition, both training and testing experiments were conducted on a PC with NVIDIA GeForce and CUDA 11.2.The SGD solver set the minimum batch size and initial learning rate to 128 and 0.03, respectively.687854 face images were available in our training set, of which 122450 natural face images and 565404 synthetic face images were available.The authentic face images are from the 300W-LP [22] dataset, which is extended using various data enhancement algorithms.We performed a total of 70 batches of training.After 30, 40, and 60 batches, we reduced the learning rate to 0.02, 0.004, and 0.0008, respectively.
AFLW2000-3D: AFLW2000-3D [22] is constructed to evaluate 3D face alignment on challenging unconstrained images.This dataset contains the first 2000 images from AFLW and expands its annotations with fitted 3DMM parameters and 68 3D landmarks.We use this dataset to evaluate the performance of our method for the face alignment and face reconstruction task.
AFLW: AFLW [38] is a large-scale face dataset, including multiple poses and views, generally used to evaluate the effectiveness of facial landmark detection.The dataset has 25,993 face images with 21 landmarks annotated for each face, whereas landmarks are not annotated for faces in invisible regions.In addition, the dataset also includes face pose angle annotations obtained from the average 3D face reconstruction.Most of the face images in the AFLW dataset are color images; a few are grayscale images, of which 59% are female, and 41% are male.This dataset is well suited for multi-angle multi-face detection, landmark localization, and head pose estimation, which is an essential dataset inside the field of face landmark alignment.
LS3D-W: LS3D-W [41] is a large-scale face alignment annotation dataset created by the computer vision laboratory at the University of Nottingham.The face images are from AFLW [38], 300-VW [47], 300-W[46], and FDDB [48].Each face image in the dataset contains 68 annotated landmarks, containing a total of approximately 230,000 accurately labeled images of faces.
CelebA: CelebA [40] is a large-scale face attribute dataset containing over 200K images, each with 40 annotations.The images in this dataset cover an extensive range of pose variations and complex backgrounds.There are 10177 identities and 202599 face images included in the CelebA dataset.

Evaluation
In terms of face alignment and reconstruction, we use NME 2d and NME 3d measurement methods to quantitatively evaluate the performance as: where V 2d i and V 2d i denote the estimated 2D landmarks and the ground truth landmarks, respectively.We give the ground truth 3D vertices V 3d i and the estimated vertices V 3d i for N k test images.D j and S j are the diagonal size of the face region in the image and 3D coordinate space, respectively.NME 2d evaluates the normalized 2D facial landmarks prediction error and NME 3d the evaluates the normalized 3D face geometry estimation accuracy.Due to the ambiguity of the weak perspective projection model, the reconstruction results of different methods have some ambiguity in the z-axis direction.We use rigid translation along the z-axis to align each result with the ground truth.

Analysis of 3D Face Alignment Results
We quantitatively evaluate the face landmark alignment performance using normalized mean error NME 2d (%) on the AFLW2000-3D [22] and AFLW [38] datasets.We divide the test set into three subsets in the test dataset based on the absolute yaw angle: [0 • , 30 • ], [30  1 , with the best results in each category highlighted in bold, and the lower values of the results, the better.Fig. 4 shows the corresponding CED curves, and our DRTN is only compared with the methods available for the codes in Table 1 .The other methods are not comparable mainly because there is no relevant open-source code.Compared to benchmark methods [13,14,22,53,55,56], the DRTN method has a lower normalized mean error on the AFLW [38] and AFLW-2000 [22] datasets.The experimental results show that the DRTN method can significantly improve the 3D face landmark alignment accuracy in the full pose range, and the face landmark alignment is also robust in large pose conditions.Fig. 5 shows our method's 3D face landmark alignment results on the AFLW [38] AFLW2000-3D [22] datasets.The advantage of using 3DMM instead of other geometry representations is that normal mapping can associate the semantic facial landmarks that can be associated with the corresponding points in reconstructed geometry.The visualization results show that the DRTN method significantly outperforms 3DDFA [22], DAMDNet [13], GSRN [55], MARN [14], and MFRRN [56] methods for 3D face landmark alignment under large pose, extreme expression, and occlusion conditions, especially for eyes, mouth, and face contours.The qualitative results show that our DRTN method can significantly improve the network's learning of face attribute features in complex environments and thus improve the accuracy of face landmark alignment.

Analysis of 3D face reconstruction results
The AFLW dataset is unsuitable for evaluating 3D face reconstruction because recovering 3D faces from annotated visible landmarks usually leads to ambiguity problems.To validate the effectiveness of our model in 3D face reconstruction, we compared the 3D normalized mean error NME 3d (%) for the AFLW2000-3D dataset, as shown in Table 2.The first best result in each category is highlighted in bold; the lower, the better.The experimental results in Table 2 show that our DRTN method outperforms the state-of-the-art methods [13,14,22,55,56] for face reconstruction in both medium and large poses.Fig. 6 The cumulative errors distribution (CED) curves on AFLW2000-3D.
Fig. 7 The CED curve of the small, medium, and large pose on AFLW2000-3D.
Compared to the recent MARN [14], the NME 3d (%) with our DRTN at offset angles [0 • , 30 • ], [30 • , 60 • ], and [60 • , 90 • ] is reduced by 3.2%, 3.9% and 4.9%, respectively.It can be seen that our method significantly improves the prediction accuracy at large offset angle [60 • , 90 • ], indicating that the model can reconstruct 3D faces accurately even under unconstrained large pose conditions.Compared with GSRN [55], our DRTN can be reduced by 1.0% and 6.6% on NME 3d (%) when the offset angles are medium pose [30 • , 60 • ] and large pose [60 • , 90 • ], respectively.This further validates the superior 3D reconstruction capability exhibited by our DRTN model in the case of complex occlusions.Fig. 6 shows the corresponding CED for 3D face reconstruction, which instinctively proves that DRTN can achieve accurate 3D face reconstruction.In order  9, the DRTN method is more geometrically accurate in terms of the reconstructed facial geometry compared to the current state-of-the-art methods, especially regarding local details of the eyes, mouth, and wrinkles.3DDFA [22], DAMDNet [13], MARN [14], and MFIRRN [56] methods, because of the lack of face attribute correlation learning, do not reconstruct faces with fine expression details.Therefore, our DRTN approach greatly advantages high-fidelity 3D facial reconstruction under large poses, occlusion, and extreme expressions.

Ablation Experiments
To verify the effectiveness of each attribute branch module in DRTN, Table 3 presents the ablation experiments on the AFLW-2000 dataset, where the normalized mean errors of face landmark alignment and reconstruction are NME 2d and NME 3d , respectively.We can see by comparing the first to fourth rows of Table 3 that the network with the face pose, expression, and identity attribute branches has lower reconstruction and alignment errors than the baseline model without any face attribute branches.The reconstruction and alignment errors are lower than those of the baseline model without face attribute branches.The experimental results show that the disentangled representation of face attributes can improve the model's attribute learning ability and robustness.In addition, from the second to the fourth row of Table 3, it can be found that the identity attribute model outperforms the expression and pose attribute models in the single-branch face attribute network because the number of identity attribute parameters in the labels of the training dataset is more than the expression and pose parameters.Therefore, when the label has more labeling information in a single-branch face attribute network, it is more beneficial for the network to learn the attribute parameters.Similarly, we can find from the fifth to seventh rows of Table 3 that the network model using two-branch face attributes has lower face alignment and reconstruction errors than the single-branch face attribute network.The results further indicate that using face attribute disentangled representation can improve the network's learning of face attribute features.Finally, the error results in the table show that the DRTN method we designed shows good attribute learning ability in face alignment and reconstruction.The main reason is that the information between face attributes is correlated with each other and does not exist singularly.We can enrich the details of face reconstruction with the correlation of face attributes.
In addition, it further demonstrates the role played by face attribute branching networks in face landmark alignment and reconstruction.We visualize the face landmark alignment and reconstruction results for each branch network, and the results are shown in Fig. 10.From the reconstructed highfidelity 3D face and geometry results, we can see that the 3D face reconstructed by our DRTN method has a natural expression and fits the shape of the original image and the geometry of the face edges and the contours of the five features are clear.We visualize the reconstructed 3D faces of each model in the form of heat maps.The heat map shows that the 3D faces reconstructed by our method are more realistic and have fewer errors.Regarding 3D face landmark alignment, the DRTN method also has higher landmark alignment accuracy than other face attribute branching models in large poses.Because other face attributes branching networks lack mutual learning between attributes in the learning process of face attribute parameters, although the network learns independent face attributes well, the accuracy of reconstruction and alignment in complex and diverse situations could be better because the 3D face is not a linear model.In contrast, the DRTN model incorporates identity, expression, and pose information and therefore shows good stability in reconstructing unconstrained scenes.

A Visualization Experiment Performed on LS3D-W
This section compares the qualitative results of our DRTN method with the most appropriate state-of-the-art methods in the LS3D-W [41].Fig. 11 shows the 3D face reconstruction and landmark alignment results in complex scenes chinaXiv:202301.00073v3with insufficient illumination and a large pose of the face.As can be seen from the red dashed box in the figure, the DRTN model has higher reconstruction accuracy than 3DDFA [22], DAMDNet [13], MARN [14], and MFIRRN [56] models in high-fidelity 3D face reconstruction, especially in terms of facial features contour, which validates that our model can better capture local details and geometric contour information.The LS3D-W [41] is collected in a highlevel unconstrained complex environment with different illumination and face offset angles.The blue dashed box in the figure shows the face landmark alignment results.The DRTN model can still show good alignment results under self-obscuring faces and low-light conditions.Therefore, the excellent experimental results verify the powerful 3D reconstruction capability and landmark alignment accuracy of the face attribute disentangling representation method.

Qualitative Results on CelebA and Casual Photos
To further validate the generalization ability of our DRTN on other datasets, we used casual photos, and the CelebA [40] for evaluation.As shown in Fig. 12, the face images in the green dashed box are the experimental results of the CelebA [40].The face images in the red dashed box are from some random photos of life and pictures of anime characters.Our DRTN accomplishes accurate landmark alignment and fine 3D face reconstruction on both CelebA [40] and casual images, reflecting its good generalization ability.Because the DRTN method allows the network to learn the correlation between identity, expression, and pose attributes and fuse the underlying information between face attributes.Therefore, accurate high-fidelity 3D faces and 3D face landmark alignment can be reconstructed.

Conclusion
In this paper, we propose a disentangled representation transformer network capable of recovering detailed 3D faces in an unconstrained environment and performing face dense alignment more accurately.Our DRTN method enhances the network's learning of potential information about face attributes and addresses the effects of facial expression, head pose, and partial occlusion chinaXiv:202301.00073v3on reconstruction and landmark alignment.Quantitative results show that the proposed DRTN model is more accurate than state-of-the-art face dense alignment and 3D reconstruction methods.In addition, extensive qualitative experiments show that DRTN can successfully reconstruct high-fidelity 3D faces from 2D face images with rich details and strong generalization ability.In future work, we will further investigate the proposed method in the context of 3D face reconstruction, such as face reconstruction of videos and cartoon character reconstruction, to further evaluate the generalization ability of the proposed method.

Fig. 3
Fig.3The pipeline of the proposed disentangled representation transformer network (DRTN).Our network consists of two parts, the decomposition part and the fusion part.The decomposition part is divided into three branches, one is for extracting expression and poses information based on identity, another is for extracting identity and pose features based on expression, and the last is for extracting identity and expression features based on the pose.The fusion module aims to obtain the face information related to each attribute from the output of the identity, expression, and pose branch networks and use the fusion module to merge the face attribute information to complete the 3D face reconstruction.During the learning phase, we use Euclidean distance loss L 68 to constrain the geometry of the face.

Fig. 8
Fig. 8 Comparison of qualitative results of 3D face reconstruction on AFLW2000-3D.Images of the first column are the Toce images.As can be seen from the image textures, the 3D face texture details reconstructed by our DRTN model are more natural and detailed.

Fig. 9
Fig. 9 Qualitative results of 3D face reconstruction on AFLW2000-3D.Best viewed on screen with zooming.

Fig. 10
Fig. 10 Comparison of different network branching structures on the AFLW-2000 dataset.(a) and (b) are reconstructed high-fidelity 3D faces and facial geometries.(c) and (d) are error maps and face landmark alignment results.

Fig. 11
Fig. 11 Comparison of models of different methods on the LS3D-W dataset.In the red dashed box is the high-fidelity 3D face.The blue dashed box shows the 3D landmark alignment results.

Fig. 12
Fig. 12 Qualitative results on casual photos and CelebA.(a) Moreover, (b) are the original map and 3D landmark alignment results of the input.(c) Moreover, (d) is the high-fidelity 3D face and facial geometry.

Table 3
Ablation study.The NME(%) of face alignment and reconstruction results on AFLW2000-3D for different network branching structures.