1 Introduction

3D face reconstruction is an essential topic in computer vision and graphics. The human face is one of the most discriminatory parts of the human body, and the facial features and contours contain many attributes and semantic information. The single-view 3D face reconstruction aims to recover a whole face geometry shape using the given single-view image, which plays a significant role in many visual analysis applications, such as face recognition [39], face verification [22], facial expression analysis [4], and facial animation [11].

However, many face images are affected by the shooting angle and the surrounding environment. Problems include partial occlusion of faces and facial blurring distorting the reconstructed 3D faces. Therefore, how to accurately conduct 3D face reconstruction and facial landmark alignment under large face poses, partial occlusion, and insufficient illumination is a problem that must be solved. Conventional methods are mainly based on optimization algorithms, e.g., iterative closest point [2], shape from shading [30, 35], and photometric stereo [10]. Nevertheless, these facial reconstruction techniques suffer from locally optimal solutions, short model initialization, and high optimal complexity making the 3D face reconstruction process complex and costly.

With the rise of deep learning, CNN network regression-based methods [28, 46] have emerged and demonstrated remarkable success in face 3D reconstruction and dense alignment. Nevertheless, it is very challenging to reconstruct 3D face shapes from 2D images without prior knowledge because 2D data must convey clear-depth information. A common approach to solve the single-view 3D face reconstruction problem is using a set of 3D base shapes to capture the subspace or construct a morphological model of face shape variation. For instance, Feng et al. [21] proposed a simple convolutional neural network for 3D face reconstruction and dense alignment that regresses the UV position map from a single 2D image and records the complete 3D face shape in UV space. This method does not rely on any a priori face model and can reconstruct the entire face geometry based on the semantic information of the face. Currently, training deep neural networks typically requires a large amount of data, while the availability of real 3D facial shape labeled data is scarce. To address this problem, Deng et al. [18] introduced a novel method for deep 3D face reconstruction, which utilizes a robust hybrid loss function for weakly supervised learning and considers supervision from both low-level and perceptual layers in the design of the loss function. Additionally, this method leverages complementary information from different images for shape aggregation, enabling multi-image face reconstruction. While this method achieves fast, accurate, and robust 3D face reconstruction under occlusion and large pose conditions, it has limitations in reconstructing 3D faces with certain yaw angles within the range of \(\left[ 60^{\circ }, 90^{\circ }\right] \). Yao et al. [20] developed a method that robustly generates UV displacement maps by extracting personalized detail parameters and generic expression parameters from a low-dimensional latent representation. They also use a regressor to predict detail, shape, color, expression, pose, and illumination parameters from a single image, enabling the learning of animatable displacement models from wild images. While this method produces rich details in reconstructed 3D faces, the network lacks learning the interrelation among facial attributes such as shape, expression, and pose during the learning process, resulting in artifacts in texture reconstruction under large pose conditions.

Yang et al. [51] introduced a large-scale detailed 3D facial dataset called FaceScape and proposed a novel algorithm to predict fine-grained and manipulatable 3D facial models from a single input image. Although this method is robust in reconstructing details from frontal facial photos, it performs poorly under self-occlusion and large pose conditions, making it unsuitable for complex environmental face reconstruction. Jackson et al. [23] pointed out that existing 3D face reconstruction methods are affected by the pose, expression, and illumination of the input face photos. Thus, the employed methods must be more complex and efficient in fitting the face model. So, their method bypasses the construction of a 3D deformation model and regresses the 3D facial geometry using a single 2D facial image. However, existing methods cannot explicitly capture information about individual face attributes, which is useful for single-view-based 3D face reconstruction and alignment. To address this challenge, [26, 38] suggested an encoder–decoder network that separates the identity and expression features in the 3D face reconstruction process from a single 2D image and encodes them nonlinearly to accomplish accurate 3D face reconstruction. While encouraging performance has been obtained, these methods are only single regression parameters that cannot explicitly exploit the complementary information between face attributes.

Fig. 1
figure 1

Conventional 3D face reconstruction and our DRTN framework. The blue dashed box encloses the regression of face attribute parameters using the traditional method. In the red dashed box is the DRTN method correlates face attribute features to reinforce the learning of attribute parameters by the network

This paper proposes a disentangled representation learning method for single-view 3D reconstruction and face alignment. Current approaches [34] mainly decompose the face attributes and individually estimate their shape, expression, and pose parameters. Although this strategy enhances learning a single-face attribute, these methods do not consider the interaction between features. Hence, we extract shape, expression, and pose parameters from 2D images and fuse the facial parameters to capture the dependencies between the attributes, which is useful for 3D face reconstruction and alignment in large poses, self-obscuration, and poor lighting conditions.

Our approach aims to obtain complementary feature information by decomposing the face attribute information to enhance the correlation between individual attributes. Hence, we carefully designed a disentangled representation transformer network (DRTN), which includes identity, expression, and pose branches. Branching the identity component aims to enhance learning the expression and pose attributes by preserving the overall face geometry structure and keeping the identity intact. Accordingly, the expression and pose parts of the branch maintain the consistency of expression and pose attributes, respectively. This helps refine the reconstruction and alignment of facial details in large poses, mainly by coupling other facial attribute parameters. The network parameters of the designed attribute branch architecture are optimized by backpropagation in an end-to-end manner. Figure 1 illustrates the proposed DRTN pipeline. Experimental results demonstrate that DRTN significantly outperforms traditional independent regression of attribute parameters in 3D face reconstruction and landmark alignment and exhibits very competitive performance on the test dataset. The contributions of our work are summarized as follows:

  1. (1)

    We develop a decomposed representation learning method for faces that models the correlation between face attributes. The proposed method reduces the ambiguity of facial attribute learning from traditional CNN-based parameter regression and enhances learning face attribute information, enabling semantic editing of identity, expression, and pose.

  2. (2)

    We propose a novel disentangled representation transformer network based on single-view 3D face reconstruction and alignment. Our framework uses face attribute branching regression for the representation, independently regressed from the identity, expression, and pose components. Based on identity attribute consistency, the identity branch introduces pose and expression attribute information to enhance the integrity of the geometric facial profile. Accordingly, the expression and pose branches refine the expression and landmark alignment effects of the 3D face by coupling other attribute parameters while maintaining the consistency of each attribute.

  3. (3)

    To further improve performance, our transformer network addresses the problem of missing details of face geometric contours. The depth model’s capacity is effectively controlled, and complementary information is extracted from the multi-attribute face input. This aims to establish similarities between shallow and deep representations of faces and to mine local attribute information of faces using the implementation of global information interactions.

2 Related works

This section briefly reviews the existing 3D facial alignment methods and reconstruction techniques.

2.1 Face alignment

Face alignment in computer vision is a long-standing and widely discussed problem. Initially, 2D face alignment methods fit the face shape of a given input image by constructing an overall fitting template and aimed to locate a set of baseline 2D facial landmarks. Representative methods of this type include active shape models (ASM) [15], active appearance models (AAM) [16], and constrained local models (CLM) [14]. However, when the pose is large, the computation is faster and more accurate in describing the face shape due to multiple classifier regressions used to locate the landmarks of 2D faces.

Recent approaches used deep learning to study 3D face alignment under large poses, with typical solutions directly regressing the face parameters by bypassing the 3D deformation model and fitting a dense 3DMM with CNN cascade regression. For example, Jourabloo et al. [29] designed a cascaded coupled regression method to estimate the camera projection matrix and 3D face landmarks by integrating a 3D point distribution model. Bulat et al. [8] proposed a heat map regression-based method to estimate 3D face landmarks. Each landmark corresponds to a heat map in this method, and these heat maps are regressed with the input RGB images by learning the 3D face depth values through a residual network. Besides, a video-based 3D cascade regression method has been developed by Jeni et al. [25]. This method generates a dense 3D shape in real time from an input 2D face image. The algorithm estimates the position of the dense set of markers and their visibility and then achieves dense 3D face alignment by fitting a partial 3D model. Ning et al. [40] proposed a real-time 3D face alignment method that uses an encoder–decoder network with efficient convolutional layers to enhance information transfer between different resolutions in the encoding and decoding stages and achieve advanced performance.

Several works have performed face alignment by fitting 3D variable shape models (3DMM) to 2D facial images. For instance, [56] develops a 3DMM using iteratively a single CNN, while the CNN augments the input channel with the represented shape features in each iteration. Liu et al. [38] uses multi-constraint training CNNs to estimate 3DMM parameters and then provides very dense 3D alignment. Jiang et al. [27] proposed a dual-attention mechanism and a practical end-to-end 3D face alignment framework. This work constructs a stable network model using deep separable convolution, densely connected convolution, and a light channel attention mechanism. This model-based 3D reconstruction method can easily accomplish 3D face alignment. However, these rely on deep networks to directly learn the relationship between 2D input images and 3D models and do not consider the integrity of the correspondence between face attributes. To address this issue, our model decomposes face attributes using a unified deep learning architecture and automatically extracts useful feature information directly from pixels by exploiting complementary information between face attributes. This approach inferred face contours in invisible regions and improved face alignment in large poses.

2.2 3D face reconstruction

Initially, 3D face reconstruction was mainly used in medical applications for human head diagnosis. Facial reconstruction is primarily conducted by scanning the face with a 3D scanner [17, 52] to obtain the face’s shape, structure, and texture information and reconstruct a full 3D face. Although it is more accurate to reconstruct 3D faces using scanners, the whole implementation process is more complicated and time-consuming. Therefore, Blanz et al. [5] proposed the 3D morphable model (3DMM), which is built based on a 3D face database using face shape and texture statistics as constraints. The influence of pose and illumination factors of the facial reconstruction process is considered to generate accurate 3D face models. Later, Paysan et al. [41] introduced a basel face model (BFM) based on the original 3DMM involving improved alignment algorithms for higher shape and texture accuracy. However, these methods establish the correspondence of the vertices between the input image and the 3D template and then solve the nonlinear optimization function to regress the 3DMM coefficients. Therefore, these methods rely heavily on the accuracy of landmarks or other feature point detectors.

Due to the wide application of deep learning in various fields, many works have recently utilized CNNs to predict face parameters for reconstruction. For instance, Elad et al. [42] introduced an end-to-end convolutional neural net framework that generates the geometry of a face in a coarse to fine manner. Subsequently, Dou et al. [19] proposed a deep neural network-based approach to improve facial expression reconstruction by integrating multi-task loss functions and fused convolutional neural networks into a DNN structure. This approach avoids the complex 3D rendering process, but the reconstruction process is only valid for frontal faces. Besides, Tran et al. [45] made a nonlinear improvement to the traditional linear 3DMM model by performing end-to-end learning in a weakly supervised manner. Lee et al. [32] used an uncertain-perception encoder that effectively combines graph convolutional neural networks and generative adversarial networks. Wu et al. [48] offered a process for learning 3D deformation models from original single-view images without external supervised learning. Additionally, Browatzki [6] developed a semi-supervised approach, where the key idea is to generate implicit face information from many existing unlabeled photos. Feng et al. [21] used UV maps to map 3D shapes to 2D images for representation and then constructed 3D face shapes. However, these methods perform better with large poses and strongly occluded faces.

The method of Jackson et al. [23] bypasses the problems associated with the construction and fitting of 3D deformation models by using a single 2D face image to de-regress the volume of the 3D face geometry for face reconstruction. The method is no longer restricted to the model space but requires a complex network structure and much time to predict the voxel data. Recently, some works decompose a given 3D face into identity and expression parts and encode them nonlinearly to achieve 3D face reconstruction with remarkable results. However, these methods do not consider the interaction between face attributes and only estimate the details individually during parameter regression. The proposed approach enhances the semantically meaningful face attribute representation Unlike the above methods,. Specifically, it directly obtains the complete 3D face geometry and its corresponding information by learning the correlation of different 3D face attribute parameters.

3 Proposed method

This section first introduces a 3D face model with latent representations and then the proposed approach based on this model. Then we propose a transformer-based joint learning pipeline for encoders and decoders. Finally, specific implementation details of this face attribute branching method are given, including the network structure, training data, and training procedure.

3.1 A composite 3D face shape model

The 3D morphable model is one of the most successful methods for describing 3D face reconstruction. In this work, we adopt a common practice in the 3D morphable model (3DMM) [5] and represent a 3D human face as a combination of shapes and expressions. Specifically, the concatenated vertex coordinates representing each 3D face shape are mathematically defined as:

$$\begin{aligned} S=\left[ x_1, y_1, z_1, x_2, y_2, z_2, \dots , x_n, y_n, z_n\right] ^T \end{aligned}$$
(1)

where n is the number of vertices in the point cloud of the 3D face and T denotes a transposed matrix. \(S_i=\left( x_i, y_i,z_i\right) \) are the coordinates of (xyz) in the Cartesian coordinate system. This paper employs 3DMM [5] to recover the 3D geometry of a human face from a single image. The principal component analysis (PCA) model of 3DMM representing the face geometry \(S_{\text{ Model } }\)is defined as:

$$\begin{aligned} \begin{aligned} S_{\text{ Model } }&={\bar{S}}+\Delta S_{i d}+\Delta E_{\exp } \\&\quad ={\bar{S}}+A_{i d} \alpha _{i d}+B_{\text {exp}} \beta _{\exp } \end{aligned} \end{aligned}$$
(2)

where \({\bar{S}} \in R^{3 n}\) is the mean geometry. \({\Delta S_{i d}}\) is the identity-sensitive difference between \({{\bar{S}}}\) and \(S_{\text{ Model } }\), and \({\Delta E_{\exp }}\) denotes the expression-sensitive difference. \({A_{i d}}\) and \({\alpha _{i d}}\) are the identity base and identity parameters of the face. \({B_{\text {exp}}}\) and \({\beta _{\exp }}\) are the expression base and expression parameters of the face. \({\bar{S}}\) and \({A_{i d}}\) are learned from the Basel Face Model [41], and \({B_{\text {exp}}}\) is obtained from FaceWarehouse [12]. During 3D face fitting, we use a weak perspective projection to project the 3D face onto the 2D face plane. This process is denoted as follows:

$$\begin{aligned} C_{(p)}=f * P_r * R * S+t \end{aligned}$$
(3)

where C is the geometry projected in image coordinates, f is a scale factor, \(P_r=\left( \begin{array}{lll}1 &{} 0 &{} 0 \\ 0 &{} 1 &{} 1\end{array}\right) \) is an orthogonal projection, R is a rotation matrix comprising nine parameters, and t is a translation vector. The 3D face reconstruction problem is transformed into a face parameter regression problem, and we have 62 parameters to regress on the 3D face model, where the pose parameter is \(v_{\text {pose }}=[f, R, t]\). Thus, the set of all model parameters is \(p=\left[ v_{\text {pose }}, {\alpha }_{i d}, {\beta }_{\text {exp }}\right] ^T\).

Fig. 2
figure 2

Model overview. The proposed transformer module comprises an encoder and a decoder. In the encoder, each attribute information of the face is extracted from the face image using a multi-layer convolutional neural network. This attribute information is then passed through the transformer block to achieve uniform encoding of attribute information. Then, for the encoded face feature sequence, we use the decoder module to extract each face attribute information in a multi-head attention mechanism and output the face attribute information through the fully connected layers

3.2 Transformer module

Previous approaches [38] used a disentangled representation to extract individual face attributes from face images. However, this strategy simultaneously models shape, expression, and texture by decomposing the underlying representation into relevant factors such as identity and expression while preserving identity information. Therefore, we use an encoder to extract three different sets of features, i.e., identity, expression, and pose. Inspired by [49], the transformer structure can extract global information from the input features and implement information exchange within each entry. In contrast, the tiny sensory field of the convolutional neural networks can only mine local information. Thus, we introduce the transformer module inside the encoder and decoder to effectively combine global and local information. This new module obtains the global representation from the shallow layer and extracts the geometric structure information of the face. Additionally, this module enhances the similarity of face attributes between shallow and deep representations and extracts deep face semantic information.

As illustrated in Fig. 2, we introduce the transformer module inside the encoder and decoder, which plays a vital role in the 3D face space. The semantic features in the image are extracted layer by layer inside the encoder in a structured way using a multi-layer convolutional connection transformer block. Specifically, the encoder module is divided into four layers and a transformer structure. First, the input face image is input to \(3 \times 3\) convolution to obtain the initial features of the face. Then the encoder network extracts the initial face features layer by layer to gradually increase the number of face feature channels. Finally, the transformer block is mapped to the high-dimensional space, where the structure encodes the face as a series of \(2048\times 7\times 7\) feature sequences. In the transformer encoder, we repeat stacking the encoder blocks L times, with L being 6. By iteratively stacking the encoder blocks, the network can effectively capture long-range dependencies within the input sequence and generate more precise representations. In addition, the traditional transformer structure uses a self-attentive mechanism to directly compute the attention weights at each sentence position in the encoding process. Then it computes the implied vector representation of the whole sentence as a sum of weights. However, this self-attentive mechanism imposes the model to overly focus on a single attribute position of the face when encoding the information at the current position. Therefore, we use a multi-headed attention mechanism to learn the different attribute behaviors of faces and then combine the behaviors with additional face attributes as knowledge. The aim is to capture the dependencies between individual attributes of the face within a sequence and improve the subspace representation.

It should be noted that our module preserves the local features of the face image, and therefore the spatial structure is maintained. In the decoder, we do not encode the image with position vectors as in TRT-ViT [49] but convolve the pre-passed face feature sequence using Conv\(1\times 1\). Then, the obtained feature sequence is input into the transformer structure. Afterward, the transformer module dimensionally transforms the input face attribute features, checks the distribution of each face attribute of the feature space, and then calculates the correlation between face attribute feature points.

Fig. 3
figure 3

Pipeline of the proposed disentangled representation transformer network (DRTN). DRTN comprises the decomposition part and the fusion part. The decomposition part is divided into three branches for extracting expression and pose information based on identity, extracting identity and pose features based on expression, and extracting identity and expression features based on the pose. The fusion module aims to obtain the face information related to each attribute from the output of the identity, expression, and pose branch networks and use the fusion module to merge the face attribute information to complete the 3D face reconstruction. During the learning phase, we use a Euclidean distance loss \({\mathcal {L}}_{68}\) to constrain the geometry of the face

3.3 Attribute decomposition representation

Single-view 3D face reconstruction aims to obtain the estimates of shape \(\alpha _{i d}\), expression \(\beta _{\text {exp}}\), and pose parameters \(v_{\text {pose}}\) given an input image \({\varvec{I}}\). A wide variety of attribute sources in the face may lead to variability in facial reconstruction, e.g., facial expressions that distort the geometric contours of the face. Therefore, our goal is to obtain the representative values of each latent attribute from an input face image \({{\varvec{I}}}\) using function \({\mathcal {F}}\):

$$\begin{aligned} \left[ \bar{{\tau }}_{ {id }}, {\bar{\tau }}_{\text{ exp } }, {\bar{\tau }}_{ \text {pose }}\right] ={\mathcal {F}}\left( {\varvec{I}}; {\alpha }_{i d}, {\beta }_{\text {exp }}, {v}_{ \text {pose }}\right) \end{aligned}$$
(4)

where \(\alpha _{i d}\), \(\beta _{\text {exp}}\), and \(v_{\text {pose}}\) are the identity, expression, and pose parameters involved in \({\mathcal {F}}\). Usually, the latent face attributes represent \({{\bar{\tau }}_{ {id }}}\), \({{\bar{\tau }}_{\text{ exp } }}\), and \({{\bar{\tau }}_{\text{ pose } }}\) in a much lower dimension than the input 2D face image \(\varvec{{I}}\) and the output 3D face shape S. Alternatively, the previous approaches extracted shape, expression, and pose attribute information independently given the input image \({{\varvec{I}}}\), i.e., \(\left[ \bar{{\tau }}_{i d}\right] ={\mathcal {F}}\left( {\varvec{I}}; {\alpha }_{i d}\right) \), \(\left[ \bar{{\tau }}_{\text {exp}}\right] ={\mathcal {F}}\left( {\varvec{I}}; {\alpha }_{\text {exp}}\right) \) and \(\left[ \bar{{\tau }}_{{\text {pose}}}\right] ={\mathcal {F}}\left( {\varvec{I}}; {\alpha }_{{\text {pose}}}\right) \).

However, this simple feature extraction strategy does not consider the correlation between face attributes but only decomposes the low-dimensional 3D faces between face attribute variables. Therefore, we design a disentangled representation transformer network (DRTN) to solve this problem. The proposed network structure is illustrated in Fig. 3, and the dependencies between face attributes are learned by regression from the identity, expression, and gesture branches. The specific learning of face attributes for each branch is as follows.

3.3.1 Identity branch

One of the branches of the identity component aims to enhance the learning of expression and pose attributes by preserving the overall face geometry structure and keeping the identity unchanged. In the identity branch, we explicitly model the individual face attribute dependencies and formulate the learning process of these parameters using three encoders \({\mathcal {E}}_{i d}\), \({\mathcal {E}}_{\text {exp}}\), and \({\mathcal {E}}_{{\text {pose}}}\). We decompose the expression and pose attributes under the condition of consistent latent representation of identity attribute \({\bar{\tau }}_{i d}={\mathcal {E}}_{i d}\left( {\varvec{I}}; {\alpha }_{i d}\right) \) as follows:

$$\begin{aligned} \begin{aligned}&{\bar{\tau }}_{i d, {\text {exp}}} ={\mathcal {E}}_{\text {exp}}\left( \beta _{ \text {exp }}; {\varvec{I}}, {\bar{\tau }}_{i d}\right) \\&\quad {\bar{\tau }}_{i d, {\text { exp }},{ {\text {pose}} }} ={\mathcal {E}}_{{\text {pose}}}\left( v_{{{\text {pose}} }}; {\varvec{I}}, {\bar{\tau }}_{i d}, {\bar{\tau }}_{i d, {\text {exp}}}\right) \\&\quad {\bar{\tau }}_{i d, {\text {pose}}} ={\mathcal {E}}_{{\text {pose}}}\left( v_{{{\text {pose}} }}; {\varvec{I}}, {\bar{\tau }}_{i d}\right) \\&\quad {\bar{\tau }}_{i d,{ {\text {pose}} },{\text {exp}} } ={\mathcal {E}}_{\text {exp}}\left( \varvec{\beta }_{ \text {exp}}; {\varvec{I}}, {\bar{\tau }}_{i d}, {\bar{\tau }}_{i d, \text {pose}}\right) \end{aligned} \end{aligned}$$
(5)

where \({\bar{\tau }}_{i d}\) is the identity attribute representation learned from the input image \({{\varvec{I}}}\) through the identity encoder \({E}_{i d}\). \({\bar{\tau }}_{i d,\text {exp}}\) is the expression attribute representation obtained by \({E}_{\text {exp}}\) encoder learning based on the \({\bar{\tau }}_{i d}\). The \({\bar{\tau }}_{ {id,{\text {exp}},{\text {pose}} }}\) is the pose attribute representation learned by the \({E}_{ {{\text {pose}} }}\) encoder based on \({\bar{\tau }}_{i d,{\text {exp}} }\). Similarly, \({\bar{\tau }}_{ {id,{\text {pose}} }}\) is the pose attribute representation obtained by the \({E}_{ {{\text {pose}} }}\) encoder learning based on the identity property \({\bar{\tau }}_{i d}\). \({\bar{\tau }}_{i d, {\text {pose}}, {\text {exp}}}\) is the expression attribute representation obtained by \({E}_{\text {exp}}\) encoder learning based on \({\bar{\tau }}_{{id,{\text {pose}} }}\). \({\mathcal {E}}(\cdot )\) is the learnable encoder among the autoencoders that have gone through different orders.

Although this method can effectively solve the problem that information between face attributes cannot interact with each other, the parameter estimation using this sequential decomposition method is scattered. Thus, we use the coupled variable method to fuse the expression and posture attributes under the condition of identity invariance and the fused attribute representation value:

$$\begin{aligned} \begin{aligned} {\bar{T}}_{\left( \alpha _{i d}, \beta _{\text {exp}}, v_{\text {pose}}\right) }&={\bar{\tau }}_{i d} \otimes {\bar{\tau }}_{i d, \text { exp }} \otimes {\bar{\tau }}_{i d, {\text { exp }}, {\text {pose}}} \\ {\bar{T}}_{\left( \alpha _{i d}, v_{\text {pose}}, \beta _{\text {exp}}\right) }&={\bar{\tau }}_{i d} \otimes {\bar{\tau }}_{i d, {\text {pose}}} \otimes {\bar{\tau }}_{i d, {\text {pose}}, {\text {exp}}} \end{aligned} \end{aligned}$$
(6)

where \(\bar{{T}}\left( \alpha _{i d}, \beta _{ \text {exp }}, v_{ {{\text {pose}} }}\right) \) and \({\bar{T}}\left( \alpha _{i d}, v_{ {{\text {pose}} }}, \beta _{\text {exp}}\right) \) represent the facial expression and pose attribute representation obtained after the facial attributes pass through different encoder networks based on identity consistency. Symbol \({\otimes }\) denotes the element-wise Hadamard product.

3.3.2 Expression branch

The expression part couples the identity and pose attributes to refine the face facial details reconstruction while preserving the consistency of expression attribute representation \({\bar{\tau }}_{exp}={\mathcal {E}}_{\text {exp}}\left( {\varvec{I}}; {\beta }_{\exp }\right) \). Its joint attribute decomposition is expressed as:

$$\begin{aligned} \begin{aligned}&{\bar{\tau }}_{ \text {exp}, i d}={\mathcal {E}}_{i d}\left( \alpha _{i d}; {\varvec{I}}, {\bar{\tau }}_{ \text {exp }}\right) \\&\quad {\bar{\tau }}_{ \text {exp }, { id,{\text {pose}} }}={\mathcal {E}}_{{\text {pose}}}\left( v_{ {{\text {pose}} }}; {\varvec{I}}, {\bar{\tau }}_{ \text {exp }}, {\bar{\tau }}_{ \text {exp }, i d}\right) \\&\quad {\bar{\tau }}_{ \text {exp }, { {\text {pose}} }}={\mathcal {E}}_{{\text {pose}}}\left( v_{ {{\text {pose}} }}; {\varvec{I}}, {\bar{\tau }}_{ \text {exp }}\right) \\&\quad {\bar{\tau }}_{{\text {exp}}, {\text {pose}}, i d}={\mathcal {E}}_{i d}\left( \alpha _{i d}; {\varvec{I}}, {\bar{\tau }}_{\text {exp}}, {\bar{\tau }}_{{\text {exp}}, {\text {pose}}}\right) \end{aligned} \end{aligned}$$
(7)

where \({\bar{\tau }}_{ \text {exp }}\) is the expression attribute representation obtained from the input image \({\varvec{I}}\) after \({E}_\text {exp }\) encoder learning. \({\bar{\tau }}_{{\text {exp}}, i d}\) is the identity attribute representation obtained by \({E}_{i d}\) encoder learning on the basis of the \(\bar{{\tau }}_{\text {exp}}\). \({\bar{\tau }}_{{{\text {exp}},id,{\text {pose}} }}\) is the pose attribute representation obtained by \({E}_{id }\) encoder learning based on \({\bar{\tau }}_{{\text {exp}}, i d}\). Similarly, \({\bar{\tau }}_{\text {exp,\, pose }}\) is the pose attribute representation obtained by \({\tau }_{ {{\text {pose}} }}\) encoder learning based on the expression property \({\bar{\tau }}_{\text {exp}}\). \({\bar{\tau }}_{{\text {exp}}, {\text {pose}},i d}\) is the identity attribute representation obtained by \({E}_{i d}\) encoder learning on the basis of \({\bar{E}}_{\text {exp,\,pose}}\). Simultaneously, the symbolic values of their identity and pose attributes representation under the condition of constant expression attributes are:

$$\begin{aligned} \begin{aligned} {\bar{T}}_{\left( \beta _{\text {exp }}, \alpha _{i d}, v_{ {{\text {pose}} }}\right) }&={\bar{\tau }}_{\text {exp }} \otimes {\bar{\tau }}_{ {{\text {exp}},id }} \otimes {\bar{\tau }}_{ {{\text {exp}},id,{\text {pose}} }} \\ {\bar{T}}_{\left( \beta _{{\text {exp}}}, v_{{\text {pose}}}, \alpha _{i d}\right) }&={\bar{\tau }}_{\text {exp}} \otimes {\bar{\tau }}_{ {{\text {exp}},{\text {pose}} }} \otimes {\bar{\tau }}_{{\text {exp}},{\text {pose}},id } \end{aligned} \end{aligned}$$
(8)

where \(\bar{{T}}\left( \beta _{ \text {exp }},\alpha _{i d}, v_{ {{\text {pose}} }}\right) \) and \({\bar{T}}\left( \beta _{ \text {exp }}, v_{ {{\text {pose}} }},\alpha _{i d}\right) \) represent the face pose and identity attributes representation obtained after the facial attributes pass through different encoder networks based on expression consistency.

3.3.3 Pose branch

The pose part aims to improve the alignment of the 3D face landmarks by coupling the identity and expression attributes while preserving the consistency of the pose attribute representation \({\bar{\tau }}_{ {{\text {pose}} }}={\mathcal {E}}_{{\text {pose}}}\left( I; v_{{{\text {pose}} }}\right) \). Its joint attribute decomposition is expressed as:

$$\begin{aligned} \begin{aligned}&{\bar{\tau }}_{{{\text {pose}},id }}={\mathcal {E}}_{i d}\left( \alpha _{i d}; {\varvec{I}}, {\bar{\tau }}_{{{\text {pose}} }}\right) \\&\quad {\bar{\tau }}_{{{\text {pose}},id,{\text {exp}} }}={\mathcal {E}}_\text {exp}\left( \beta _\text {exp}; {\varvec{I}}, {\bar{\tau }}_{{{\text {pose}} }}, {\bar{\tau }}_{{{\text {pose}},id }}\right) \\&\quad {\bar{\tau }}_{{{\text {pose}},{\text {exp}} }}={\mathcal {E}}_\text {exp}\left( \beta _{\text {exp }};{\varvec{I}}, {\bar{\tau }}_{{{\text {pose}} }}\right) \\&\quad {\bar{\tau }}_{{{\text {pose}},{\text {exp}},id }}={\mathcal {E}}_{i d}\left( \alpha _{i d}; {\varvec{I}}, {\bar{\tau }}_{{{\text {pose}} }}, {\bar{\tau }}_{{{\text {pose}},{\text {exp}} }}\right) \end{aligned} \end{aligned}$$
(9)

where \({\bar{\tau }}_{{{\text {pose}} }}\) is the pose attribute representation obtained from the input image \({{\varvec{I}}}\) after \({E}_{{\text {pose}}}\) learning. \({\bar{\tau }}_{{\text {pose}},id}\) and \({\bar{\tau }}_{\text {pose,\,exp}}\) are the identity and expression attribute representations obtained by the \({E}_{id}\) and \({E}_{\text {exp}}\) encoders using \({\bar{\tau }}_{ {{\text {pose}}}}\). \({\bar{\tau }}_{ {{\text {pose}},id,{\text {exp}} }}\) is the expression attribute representation obtained by \({E}_{\text {exp}}\) encoder learning based on \({\bar{\tau }}_{{{\text {pose}},id }}\). By the same token, \({\bar{\tau }}_{{{\text {pose}},{\text {exp}},id }}\) is the identity attribute representation obtained by \({E}_{id}\) encoder learning on the basis of \({\bar{\tau }}_{\text {pose,\,exp }}\). Therefore, the standard attribute representation value of identity and expression under the condition of a constant pose is:

$$\begin{aligned} \begin{aligned} \bar{{T}}_{\left( v_{ {{\text {pose}} }}, {\alpha }_{i d}, {\beta }_{\text {exp }}\right) }&={\bar{\tau }}_{{{\text {pose}} }} \otimes {\bar{\tau }}_{{{\text {pose}},id }} \otimes {\bar{\tau }}_{{\text {pose}},id,{\text {exp}}}\\ {\bar{T}}_{\left( v_{{{\text {pose}} }}, \beta _{\text {exp }}, \alpha _{i d}\right) }&={\bar{\tau }}_{{{\text {pose}} }} \otimes {\bar{\tau }}_{\text {pose,\,exp}} \otimes {\bar{\tau }}_{{{\text {pose}},{\text {exp}},id }} \end{aligned} \end{aligned}$$
(10)

where \(\bar{{T}}\left( v_{{{\text {pose}}}},\alpha _{i d},\beta _{ \text {exp }}\right) \) and \({\bar{T}}\left( v_{ {{\text {pose}} }},\beta _{ \text {exp }},\alpha _{i d}\right) \) represent the facial expression and identity attribute representation obtained after the facial attributes pass through different encoder networks based on pose consistency.

3.3.4 Fusion module

In order to reconstruct a more realistic and complete 3D human face, we extract the face information related to each attribute from the identity, expression, and pose branching networks and use the fusion module to merge the face attributes to complete an accurate 3D face reconstruction. Due to the complementary information provided by the different attribute representations, inputting the attribute representations obtained from these three branch networks into the fusion network allows the network to leverage the complementary information between different attributes. This improves the modeling and reconstruction capabilities of facial features and aids in reducing information loss and reconstruction errors during the learning process, leading to overall higher reconstruction quality and accuracy. The face attribute representation of the fusion module is represented as:

$$\begin{aligned} \begin{aligned} G=F\left( \begin{array}{c}{\text {Cat}}\left[ {{\bar{T}}}_{\left( \alpha _{i d}, \beta _{\text {exp}}, v_{{\text {pose}}}\right) }, {{\bar{T}}}_{\left( \alpha _{i d}, v_{{\text {pose}}}, \beta _{\text {exp}}\right) }\right] , \\ {\text {Cat}}\left[ {{\bar{T}}}_{\left( \beta _{\text {exp}}, \alpha _{i d}, v_{{\text {pose}}}\right) }, {{\bar{T}}}_{\left( {\beta }_{\text {exp}}, v_{p o s e}, \alpha _{i d}\right) }\right] ,\\ {\text {Cat}}\left[ {{\bar{T}}}_{\left( v_{{\text {pose}} }, \alpha _{i d}, {\beta }_{\text {exp}}\right) }, {{\bar{T}}}_{\left( v_{{\text {pose}}}, {\beta }_{e x p}, \alpha _{i d}\right) }\right] \end{array}\right) \end{aligned} \end{aligned}$$
(11)

where \({\text {Cat}}(\cdot )\) represents the concatenated value of each facial attribute representation, \(F(\cdot )\) fuses the concatenated facial attribute representations, and G represents the feature output after fusing the facial attributes.

3.4 Objective loss function

In the face attribute branching network, we introduce four learning objectives in model training. The learning face attribute parameters mainly involve identity, expression, and pose. To improve the network’s effective learning of identity attributes, we optimize the parameters \({\alpha _{i d}}\) by minimizing the vertex distance between the predicted face geometry and the ground truth 3D face as:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{i d}&=\left\| \left( S_{\text {model }}-{\bar{S}}_{\text {model }}\right) \right\| \\&\quad =\left\| \left( \Delta S_{i d}-\Delta {\bar{S}}_{i d}\right) \right\| ^2 \\&\quad =\left\| \left( A_{i d} \alpha _{i d}-A_{i d} {\bar{\alpha }}_{i d}\right) \right\| ^2 \end{aligned} \end{aligned}$$
(12)

where \({\alpha }_{i d}\) denotes the predicted face identity parameter and \(\bar{\varvec{\alpha }}_{i d}\) is the ground truth parameter. \({A}_{i d}\) is the identity base of the 3D morphable model PCA. In addition, different \({\alpha }_{id}\) dimensions and singular values affect the face geometry differently. Therefore, among the identity branches, the constraint of \({\mathcal {L}}_{i d} \) identity information can reduce the influence of essential dimensions, especially those with large singular values. Similarly, to refine the expression details of the face, we use the expression consistency loss to enhance expression preservation:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{\text {exp} }&=\left\| \left( \Delta E_{\text {exp }}-\Delta {\bar{E}}_\text {exp }\right) \right\| ^2 \\&\quad =\left\| \left( B_\text {exp } {\beta }_\text {exp }-B_\text {exp } {\bar{\beta }}_\text {exp }\right) \right\| ^2 \end{aligned} \end{aligned}$$
(13)

where \({\beta }_{\text {exp}}\) denotes the predicted face expression parameter and \({{\bar{\beta }}_{\text {{exp}}}}\) is the expression ground truth parameter. \({B}_{\text {exp}}\) is the expression base of the 3D morphable model PCA. Since the pose of a face has limited degrees of freedom, to simplify the computation, we constrain the pose estimation using the loss of the facial landmarks:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{p}&=\Vert \big (f * P_{r} * R * {\bar{S}}_{ \text {model }}+t\big )- \\&\quad \big ({\bar{f}} * P_{r} * {\bar{R}} * {\bar{S}}_{\text {model }}+{\bar{t}}\big ) \Vert ^{2} \end{aligned} \end{aligned}$$
(14)

where \({v}_{{\text {pose}}}=[f, R, t]\) is the prediction of the pose parameters and \({\bar{v}}_{{{\text {pose}} }}=[{\bar{f}}, {\bar{R}}, t]\) is the ground truth of the pose parameter. Thus, we further improve the face landmark alignment accuracy under significant pose conditions by constraining the pose parameters. Although the \({\mathcal {L}}_{i d}\), \({\mathcal {L}}_{\text {exp}}\), and \({\mathcal {L}}_{{\text {pose}}}\) loss functions have strong constraints on the identity, expression, and pose attributes of the three branches, the reconstructed 3D faces still lack the constraints of geometric contours. Therefore, we use the constraints of sparse 2D face landmarks to improve the reconstructed geometric contour information as:

Table 1 \(\mathrm {NME_{2d}}(\%)\) of face alignment results on AFLW and AFLW2000-3D
$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{68}&=\Vert \big (f * P_{r} * R * {S}_{ {68}}+t\big )- \\&\quad \big ({\bar{f}} * P_{r} * {\bar{R}} * {\bar{S}}_{{68 }}+{\bar{t}}\big ) \Vert ^{2} \end{aligned} \end{aligned}$$
(15)

The final hybrid loss function \({{\mathcal {L}}}\) is defined as:

$$\begin{aligned} {\mathcal {L}}=\lambda _{i d} {\mathcal {L}}_{i d}+\lambda _{\exp } {\mathcal {L}}_{\exp }+\lambda _p {\mathcal {L}}_p+\lambda _{68} {\mathcal {L}}_{68} \end{aligned}$$
(16)

where \(\lambda _{i d}\), \(\lambda _{\text {exp}}\),\(\lambda _{{\text {pose}}}\), and \(\lambda _{68}\) are the weights to balance these constraints.

4 Experiments

This section conducts several experiments and ablation studies to demonstrate DRTN’s effectiveness under different settings in 3D face reconstruction and dense face alignment on three extensively evaluated datasets, 300W-LP [56], AFLW2000-3D [56], and AFLW [31]. In addition, we further evaluate the generalization performance of DRTN on the LS3D-W [7] and CelebA [37] datasets. The process of 3D face reconstruction and the details of experimental implementation can be found in our supplementary material.

4.1 Datasets

The proposed DRTN is trained on 300W-LP [56] and evaluated on five popular datasets: AFLW [31], AFLW2000-3D [56], LS3D-W [7], and CelebA [37].

4.1.1 300W-LP

The 300W-LP dataset [38] is extended from 300W by standardizing multiple alignment datasets with 68 landmarks, including AFW [57], LFPW [3], HELEN [55], IBUG [43] and XM2VTS [18]. By employing facial analysis techniques [56], we generated 61,225 samples under challenging pose conditions. This includes 1786 samples from IBUG, 5207 samples from AFW, 16,556 samples from LFPW, and 37,676 samples from HELEN. It is worth noting that we did not use the XM2VTS dataset. Additionally, we augmented the dataset by horizontally flipping the 61,225 samples, resulting in 122,450 natural face samples and 565,404 synthetic face samples. During network training, we employed 636,252 samples and reserved 51,602 samples for testing, ensuring no overlap in the identities.

4.1.2 AFLW2000-3D

AFLW2000-3D [56] evaluates 3D face alignment on challenging unconstrained images. The first 2000 images are from the AFLW dataset, and its annotations are expanded using 3DMM parameters and 68 3D landmarks. This dataset evaluates our method’s performance on face alignment and reconstruction tasks.

4.1.3 AFLW

AFLW [31] is a large-scale face dataset that includes multiple poses and views, which is generally used to evaluate the effectiveness of facial landmark detection. The dataset has 25,993 face images with 21 landmarks annotated for each face, whereas landmarks are not annotated for faces in invisible regions. Additionally, the dataset includes face pose angle annotations obtained from the average 3D face reconstruction. Most face images in the AFLW dataset are color images, and a few are grayscale images, of which \(59\%\) are female and \(41\%\) are male. This dataset is well suited for multi-angle multi-face detection, landmark localization, and head pose estimation and is an essential dataset for face landmark alignment.

Fig. 4
figure 4

Cumulative errors distribution (CED) curves on AFLW and AFLW2000-3D

Fig. 5
figure 5

Comparison of 3D facial landmark detection with 3DDFA [56], DAMDNet [26], MARN [34], MFRRN [33], and DRTN(Ours) on AFLW2000-3D. Best viewed on screen by zooming in

Table 2 \(\mathrm {NME_{3 d}}(\%)\) of face reconstruction results on AFLW2000-3D

4.1.4 LS3D-W

LS3D-W [7] is a large-scale face alignment annotation dataset created by the computer vision laboratory at the University of Nottingham. The face images are from AFLW [31], 300-VW [44], 300-W [1], and FDDB [24]. Each face image contains 68 annotated landmarks. Thus, the dataset contains approximately 230,000 accurately labeled face images.

4.1.5 CelebA

CelebA [37] is a large-scale face attribute dataset containing over 200K images, each with 40 annotations. The images cover a range of pose variations and complex backgrounds, involving 10,177 identities and 202,599 face images from the CelebA dataset.

4.2 Analysis of 3D face alignment results

We quantitatively evaluate the face landmark alignment performance using the normalized mean error \(\mathrm {NME_{2d}}(\%)\) on the AFLW2000-3D [56] and AFLW [31] datasets. We divide the test set into three subsets based on the absolute yaw angle: \(\left[ 0^{\circ }, 30^{\circ }\right] \), \(\left[ 30^{\circ }, 60^{\circ }\right] \), and \(\left[ 60^{\circ }, 90^{\circ }\right] \). The proposed DRTN method was compared experimentally with the current state-of-the-art methods [9, 13, 26, 33, 34, 36, 45, 47, 50, 53, 54, 56]. The corresponding results are reported in Table 1, where the lower the value, the better (the best results in each category are highlighted in bold). Figure 4 illustrates the corresponding CED curves. Note that DRTN is only compared with the methods in Table 1 with an open-source code. Compared to the benchmark methods [26, 33, 34, 36, 47, 56], the DRTN method has a lower normalized mean error on the AFLW [31] and AFLW-2000 [56] datasets. The experimental results demonstrate that the DRTN method significantly improves the 3D face landmark alignment accuracy in the full pose range, and the face landmark alignment is also robust in large pose conditions.

Figure 5 highlights our method’s 3D face landmark alignment results on the AFLW2000-3D [56] datasets. The advantage of using 3DMM instead of other geometry representations is normal mapping associating the semantic facial landmarks that can be associated with the corresponding points in the reconstructed geometry. The visualization results infer that the DRTN method significantly outperforms the 3DDFA [56], DAMDNet [26], GSRN [47], MARN [34], and MFRRN [33] methods for 3D face landmark alignment under large pose, extreme expression, and occlusion conditions, especially for eyes, mouth, and face contours. The qualitative results suggest that DRTN significantly improves the network’s learning of face attribute features in complex environments and thus improves the accuracy of face landmark alignment.

Fig. 6
figure 6

Cumulative errors distribution (CED) curves on AFLW2000-3D

Fig. 7
figure 7

CED curve of the small, medium, and large pose on AFLW2000-3D

Fig. 8
figure 8

Comparison of qualitative results of 3D face reconstruction on AFLW2000-3D. Images of the first column are the Toce images. The image textures reveal that 3D face texture details reconstructed by DRTN are more natural and detailed

Fig. 9
figure 9

Qualitative results of 3D face reconstruction on AFLW2000-3D. Best viewed on screen with zooming

Table 3 Ablation study. \(\mathrm {NME(\%)}\) of face alignment and reconstruction results on AFLW2000-3D for different network branching structures

4.3 Analysis of 3D face reconstruction results

The AFLW dataset is unsuitable for evaluating 3D face reconstruction because recovering 3D faces from annotated visible landmarks usually leads to ambiguity. To validate the effectiveness of our model in 3D face reconstruction, we compared the 3D normalized mean error \(\mathrm {NME_{3 d}}(\%)\) for the AFLW2000-3D dataset (see Table 2). The first best result in each category is highlighted in bold; the lower the value, the better. The experimental results in Table 2 highlight that our DRTN method outperforms the state-of-the-art methods [26, 33, 34, 47, 56] for face reconstruction under both medium and large poses.

Fig. 10
figure 10

Comparison of different network branching structures on the AFLW2000-3D dataset. \({\textbf {a}}\) and \({\textbf {b}}\) are reconstructed high-fidelity 3D faces and facial geometries. \({\textbf {c}}\) and \({\textbf {d}}\) are error maps and face landmark alignment results

Table 4 Face attribute disentanglement performance assessed using the \(\mathrm {NME_{3 d}}(\%)\) on the AFLW2000-3D dataset

Compared to the recent MARN [34], the \(\mathrm {NME_{3 d}}(\%)\) of our DRTN at offset angles \(\left[ 0^{\circ }, 30^{\circ }\right] \), \(\left[ 30^{\circ }, 60^{\circ }\right] \), and \(\left[ 60^{\circ }, 90^{\circ }\right] \) is reduced by \(3.2 \%\), \(3.9 \%\), and \(4.9 \%\), respectively. Therefore, our method significantly improves the prediction accuracy at a large offset angle \(\left[ 60^{\circ }, 90^{\circ }\right] \), indicating DRTN can reconstruct 3D faces accurately even under unconstrained large pose conditions. Compared with GSRN [47], DRTN attains a reduced \(\mathrm {NME_{3 d}}(\%)\) by \(1.0 \%\) and \(6.6 \%\) when the offset angles have a medium pose \(\left[ 30^{\circ }, 60^{\circ }\right] \) and large pose \(\left[ 60^{\circ }, 90^{\circ }\right] \), respectively. This further validates the superior 3D reconstruction capability of DRTN in complex occlusions. Figure 6 depicts the corresponding CED for 3D face reconstruction, which instinctively proves that DRTN can achieve accurate 3D face reconstruction.

In order to prove our method’s effectiveness on face reconstruction under large poses, in the following experiment, we regard the whole AFLW2000-3D as the testing set and divide it into three subsets according to their absolute yaw angles:\(\left[ 0^{\circ }, 30^{\circ }\right] \), \(\left[ 30^{\circ }, 60^{\circ }\right] \), and \(\left[ 60^{\circ }, 90^{\circ }\right] \) with 1312, 383, and 305 samples, respectively. In Fig. 7, the CED curve results are depicted from left to right for the small pose \(\left[ 0^{\circ }, 30^{\circ }\right] \), the medium pose \(\left[ 30^{\circ }, 60^{\circ }\right] \), and the large pose \(\left[ 60^{\circ }, 90^{\circ }\right] \), respectively. Comparing the results of the CED curves reveals that our DRTN method achieved an improved performance under small, medium, and large poses, validating our method’s robustness for reconstruction tasks under large poses. Figure 8 visualizes the results demonstrating that DRTN reconstructs 3D faces with clear texture and natural expressions under large poses, extreme expressions, and partial occlusion. As depicted in Fig. 9, the DRTN method is more geometrically accurate regarding the reconstructed facial geometry than the current state-of-the-art methods, especially when considering the local details of the eyes, mouth, and wrinkles. The 3DDFA [56], DAMDNet [26], MARN [34], and MFIRRN [33] methods do not reconstruct faces with fine expression details because of the lack of face attribute correlation learning. Therefore, the experiments reveal that our DRTN approach has significant advantages in high-fidelity 3D facial reconstruction under large poses, occlusion, and extreme expressions.

4.4 Ablation experiments

To verify the effectiveness of each attribute branch module in DRTN, Table 3 presents the ablation experiments on the AFLW-2000 dataset, where the normalized mean errors of face landmark alignment and reconstruction are \(\mathrm {NME_{2d}}\) and \(\mathrm {NME_{3d}}\), respectively. Comparing the first to fourth rows of Table 3 reveals that the network with the face pose, expression, and identity attribute branches has lower reconstruction and alignment errors than the baseline model without any face attribute branches.

The experimental results indicate that the disentangled representation of face attributes improves the model’s attribute learning ability and robustness. In addition, the second to the fourth rows of Table 3 reveal that the identity attribute model outperforms the expression and pose attribute models in the single-branch face attribute network because the number of identity attribute parameters in the labels of the training dataset is more than the expression and pose parameters. Therefore, when the label has more labeling information in a single-branch face attribute network, it is more beneficial for the network to learn the attribute parameters.

Similarly, the fifth to seventh rows of Table 3 reveal that the network model using two-branch face attributes has lower face alignment and reconstruction errors than the single-branch face attribute network. The results further indicate that using face-attribute disentangled representation improves the network’s ability to learn face-attribute features. Finally, the error results in the table show that the designed DRTN method shows good attribute learning ability in face alignment and reconstruction. The main reason is that the information between face attributes is correlated, there is no singularly, and we enrich the details of face reconstruction using the correlation of face attributes.

Fig. 11
figure 11

Experiments on latent space interpolation

Fig. 12
figure 12

Visual reconstruction comparison of DRTN with the approaches of Deng et al. [18], DECA [20], and FaceScape [51] on the AFLW2000-3D dataset

Additionally, the results demonstrate the role of face attribute branching in face landmark alignment and reconstruction. We visualize the face landmark alignment and reconstruction results for each branch network, with the results depicted in Fig. 10. From the reconstructed high-fidelity 3D face and geometry results, it is evident that the 3D face reconstructed by DRTN has a natural expression and fits the shape of the original image. The geometry of the face edges and the contours of the five features are clear. Moreover, we visualize the reconstructed 3D faces of each model in the form of heat maps, revealing that the 3D faces reconstructed by our method are more realistic and have fewer errors. Regarding 3D face landmark alignment, DRTN has a higher landmark alignment accuracy than other face attribute branching models under large poses. This is because other face attributes branching networks lack mutual learning between attributes during learning the face attribute parameters. Although the network learns independent face attributes well, the accuracy of reconstruction and alignment in complex and diverse situations could be better because the 3D face is not a linear model. In contrast, the DRTN model incorporates identity, expression, and pose information and therefore shows good stability in reconstructing unconstrained scenes.

4.5 Attribute disentangled representation experiment

Next, we present more comprehensive evidence to demonstrate the effective attribute disentanglement achieved by controlling attribute-specific operation tasks and observing the model’s responses to changes in identity, expression, and pose while keeping other attributes constant. We use normalized the mean error in Table 4 to measure the disentanglement performance, separately reconstructing each attribute during the disentanglement process. For example, in the identity attribute branch of Table 4, we input face images into the model to obtain the intermediate identity attribute representation while setting expression and pose attribute information to zero. This strategy results in a reconstructed normalized mean error of 6.222 and ensures that the model focuses solely on the disentanglement of the identity attribute.

In the single-branch attribute network, we validate the effectiveness of combining identity, expression, and pose attributes by controlling individual attributes and assessing their impact on others. We use the reconstructed normalized mean error to evaluate the attribute disentanglement capability. From the results in Table 4, we observe that our attribute branch network effectively enhances the precision of 3D face reconstruction and improves the network’s learning of facial identity, expression, and pose attributes. Additionally, we leverage the learned latent space to reconstruct 3D face models by gradually interpolating between identity and expression attributes. Due to singularity issues, we indirectly reflect the results of the pose attribute by projecting from 3D to 2D, as shown in Table 4. In Fig.  11, our disentangled representation comprises two latent codes: one for identity and one for expression. By preserving the identity attribute constant in rows and interpolating the expression latent code, we utilize the trained decoder to reconstruct 3D face models with varying expressions. Similarly, regarding columns, we maintain the expression attribute constant and interpolate the identity latent code to reconstruct 3D face models with different identities. In this experiment, we use a step size of 0.30 for latent code interpolation for identity and expression, and the interpolation results appear meaningful and reasonable. Hence, our DRTN method effectively disentangles facial attributes.

4.6 Qualitative comparison of advanced methods

To demonstrate our method’s superior performance, we conducted additional experiments to compare DRTN with the latest approaches. Due to the differences between our method and state-of-the-art approaches, we cannot directly conduct quantitative comparison experiments with them. To compensate for this limitation, we conducted additional qualitative experiments and analyses to demonstrate the advantages and potential of our method in 3D face reconstruction under large pose and occlusion conditions. As depicted in Fig. 12, DRTN is compared with the advanced methods of Deng et al. [18], DECA [20], and FaceScape [51] based on visual qualitative results. According to Deng’s [18] method has limitations in 3D face reconstruction under large yaw angles, as evidenced by its inability to reconstruct the faces in the first and second input images. This is primarily due to their method’s inability to face detection under large poses. While DECA [20] and FaceScape [51] methods can reconstruct complete 3D faces under large pose conditions, their reconstructed 3D facial textures exhibit artifacts under low lighting and large pose conditions. In contrast, DRTN can effectively reconstruct 3D faces with textures, and the natural contour of facial expressions fits the true 3D facial shape better.

5 Conclusion

This paper proposes a disentangled representation transformer network that can recover detailed 3D faces in an unconstrained environment and performs dense face alignment more accurately. The developed DRTN method enhances the network’s learning of latent information about face attributes. It addresses the effects of facial expression, head pose, and partial occlusion on reconstruction and landmark alignment. Quantitative results demonstrate that the proposed DRTN model is more accurate than state-of-the-art dense face alignment and 3D reconstruction methods. Additionally, extensive qualitative experiments reveal that DRTN can successfully reconstruct high-fidelity 3D faces from 2D face images with rich details and strong generalization ability. Future work will further investigate the proposed method in 3D face reconstruction, such as face reconstruction of videos and cartoon character reconstruction, to evaluate DRTN’s generalization ability [21].