Joint 3D facial shape reconstruction and texture completion from a single image

Recent years have witnessed significant progress in image-based 3D face reconstruction using deep convolutional neural networks. However, current reconstruction methods often perform improperly in self-occluded regions and can lead to inaccurate correspondences between a 2D input image and a 3D face template, hindering use in real applications. To address these problems, we propose a deep shape reconstruction and texture completion network, SRTC-Net, which jointly reconstructs 3D facial geometry and completes texture with correspondences from a single input face image. In SRTC-Net, we leverage the geometric cues from completed 3D texture to reconstruct detailed structures of 3D shapes. The SRTC-Net pipeline has three stages. The first introduces a correspondence network to identify pixel-wise correspondence between the input 2D image and a 3D template model, and transfers the input 2D image to a U-V texture map. Then we complete the invisible and occluded areas in the U-V texture map using an inpainting network. To get the 3D facial geometries, we predict coarse shape (U-V position maps) from the segmented face from the correspondence network using a shape network, and then refine the 3D coarse shape by regressing the U-V displacement map from the completed U-V texture map in a pixel-to-pixel way. We examine our methods on 3D reconstruction tasks as well as face frontalization and pose invariant face recognition tasks, using both in-the-lab datasets (MICC, MultiPIE) and in-the-wild datasets (CFP). The qualitative and quantitative results demonstrate the effectiveness of our methods on inferring 3D facial geometry and complete texture; they outperform or are comparable to the state-of-the-art.


Introduction
This paper attacks the problem of recovering 3D shape and complete facial texture from a single 2D face image. Image-based 3D face reconstruction is a fundamental yet essential problem in computer vision, with broad applications to facial animation [1,2], pose-invariant face recognition [3][4][5][6][7], human-machine interaction [8], etc. Dramatic improvements have been made in image-based 3D face reconstruction in recent years [9][10][11].
Current image-based 3D face reconstruction methods can roughly be divided into two categories: model-based methods and pixel-wise regression methods. The first category utilizes 3D template models of the face for reconstruction. The 3D morphable model (3DMM) [12][13][14] is a popular choice within the first category due to its simplicity and efficiency. The main defect of 3DMM, however, is its low-dimensional representation, which limits its representational ability and always results in oversmooth 3D faces. Bas et al. [15] and Feng et al. [10] use a BFM model [16] as a prior shape to densely predict the 3D face as a U -V position map. However, their methods cannot effectively reconstruct local structures of the input face. Pixel-wise regression methods [17][18][19] directly predict the depth or 3D locations of each pixel in the input, and can be effectively implemented with modern deep networks, e.g., U-Net [20]. Their pixel-wise nature enables these methods to capture fine facial details, but these methods lack the dense correspondence between 2D input image and 3D facial template, which may be limiting in practical applications like expression transfer and animation. Moreover, these methods cannot handle self-occluded regions well.
Compared to 3D geometry, recovering complete facial texture has received less attention. It is an ill-posed problem since certain face regions can be self-occluded in the input image. Previous texture completion methods roughly fall into two categories: appearance-based methods [14,[21][22][23][24][25] and geometric modeling methods [3,[26][27][28]. Appearance-based methods often leverage either linear appearance models (e.g., diffuse albedo) to recover low-frequency facial appearance or generative adversarial networks (GANs) [29] to synthesize frontalized face images. However, appearance-based methods may produce unrealistic visual results and are often unable to preserve facial identity. On the other hand, geometricmodeling-based work [3,[26][27][28] determines dense correspondences between a 2D face image and a predefined 3D face template (model), which allows mapping of the 2D face image into U -V space and acquisition of a U -V texture map. Then, one can inpaint or complete the self-occluded textures using the U -V map. However, 3D geometric models like 3DMM only use a limited number of coefficients or landmarks for densely correspondence to the input image with tens of thousands of pixels or more. This imbalance inevitably causes certain non-facial areas (e.g., background, hair) in the input to also be mapped into the facial texture map. Downstream inpainting algorithms cannot remove these defects, leading to poor facial texture completion.
To address the above problems, we propose a novel shape reconstruction and texture completion network, SRTC-Net, which jointly reconstructs 3D facial geometry and completes texture from a single input image. For texture reconstruction, we first learn the 2D-to-3D correspondence via two cascading tasks: facial segmentation and PNCC regression. Within this correspondence, then, we unwrap input images into U -V space to get U -V textures. Subsequently, we complete the U -V texture with an inpainting network. For shape reconstruction, we obtain coarse shapes with a model-based method. In addition, we extract geometric cues from the completed U -V texture with a pixel-wise method, and integrate them into the outputs of the model-based methods.
As shown in Fig. 1, our method achieves more accurate 2D-to-3D correspondence than previous methods, especially at edges. Our contributions can Fig. 1 Examples of correspondences between 2D faces and 3D models provided by PRNet [10], Deep3DFace [11], and our method. PRNet and Deep3DFace produce artifacts caused by inaccurate 2D-to-3D correspondences (green box), while ours does not. be summarized as follows: 1. A deep architecture, SRTC-Net, which reconstructs 3D shape and completes 3D texture jointly. SRTC-Net decomposes this ill-posed task into three more tractable problems. 2. A corresponding network (C-Net) to densely align the 2D image and 3D face model via facial segmentation using a segmentation subnetwork and a PNCC prediction subnetwork. This lets us establish more accurate 2D-to-3D correspondences and improves the performance of 3D texture completion results. The strong geometric cues in the completed 3D texture can also be exploited to refine 3D face shapes. 3. A shape network (S-Net) for 3D geometry reconstruction, which lets us reconstruct detailed structure in pixel-wise fashion while preserving topology. Specifically, the S-Net reconstructs a coarse shape with a model-based coarse-shape subnetwork, and then predicts detailed structures with a pixel-wise regression architecture in a fineshape subnetwork. We build the correspondence between coarse and fine shapes by sharing the same U -V coordinates. 4. Extensive qualitative and quantitative experiments on both controlled and in-the-wild databases which demonstrate the effectiveness of our methods: they outperform or are comparable to state-of-the-art approaches to 3D shape reconstruction and texture completion.

Correspondence between 2D face images and 3D models
Determining correspondences between a 2D face image and the 3D template model is a crucial step in the U -V texture mapping pipeline. Akshay et al. [30] map a 2D face image to a 3D model by finding a rigid transformation to match the vertices of the template 3D face model to 2D feature points detected by VAAM [31]. This method, however, requires near-frontal views of constrained faces so is not suitable for our purposes. Deng et al. [3] and Xue et al. [28] apply a landmark-driven 3DMM fitting method to establish the correspondences. Chang et al. [27] employ an off-the-shelf 3D face reconstruction pipeline, PRNet [10], to align a 2D face to a 3D face model. All these 3D face shape estimation methods can introduce facial misalignments, causing non-facial areas to invade the U -V texture map. Tapping into more empirical knowledge, Hassner et al. [26] propose use of a supplemental 3D reference model to produce front-facing views for all input images, in which landmarks are used to align the input image with the template reference image. Their method can handle 2D faces with neutral expressions. However, it distorts the final texture if the input face has an exaggerated expression. This deficiency is one of the reasons we apply the C-Net in our pipeline, to establish accurate 2D-3D correspondences to enable our framework to reconstruct high-quality 3D face textures.

Image completion
Inspired by empirical observations that there is a relationship between local and global texture, Wei et al. [32] apply a Markov random field model for image completion. Efros and Leung [33] apply nonparametric exemplar-based techniques to synthesize textures by assembling paths in the example image. Such techniques only work for stochastic textures, not an image with fine details. With the advent of deep learning, convolutional neural networks (CNNs) were also deployed in this task. Unlike the general image completion task, face completion is susceptible to significant appearance variations. To ensure genuineness and consistency between local and global content, Li et al. [34] train the face completion network utilizing local and global discriminators as well as a fixed parsing network. The U -V -GAN [3] uses the same loss function, but replaces the parsing network with a fixed identity classification network. Chang et al. [27] propose a U -V texture completion GAN (TC-GAN) to complete the U -V texture map. In our work, we go further, predicting both a U -V texture map and a U -V position map.

Pose-invariant face recognition
With deeper convolutional neural networks and advanced loss functions [35], many approaches have achieved significant improvements on unconstrained face recognition datasets. However, pose variation is still very challenging even for deep learningbased face recognition methods, due to missing facial texture caused by self-occlusion; recognition accuracy plummets when it comes to profile-frontal recognition [36]. GAN-based face frontalization [3,14,24,37,38] is a popular method that decreases the pose discrepancy and improves frontal-profile face verification. Our work can infer both facial texture and geometry, offering an out-of-the-box face frontalization component conducive to pose invariant face recognition.

3D face reconstruction
As a popular method, the 3D Morphable Model (3DMM) method [12] reconstructs a 3D model from a single image in a low-dimensional linear space. Tran et al. [13] regress the shape, texture, and parameter coefficients with a deep neural network. Guo et al. [39] use meta-joint optimization and a short-video-synthesis strategy to learn a lightweight network to reconstruct 3D faces quickly, accurately, and robustly. Sanyal et al. [40] use shape consistency across multiple images of the same person and inconsistency across multiple images of different person to train a 3D face reconstruction network without supervision. At the same time, non-linear 3D morphable models [41,42] were introduced to capture more details. Bas et al. [15] encode 3D coordinates of vertices into a position map and process a 2D position map with convolutional neural network learning. Wu et al. [43] employ a symmetry constraint in a photogeometric autoencoder for unsupervised learning that disentangles shape and appearance. Lin et al. [44] propose a self-supervised graph network to reconstruct realistic facial texture from a single image. Lee and Lee [45] embed uncertainly learning in a deep network to reconstruct high-fidelity 3D faces. Lattas et al. [46] jointly capture facial albedo, 3D shape, and a normal map from an input image, using a cascaded pipeline. Our work can build visually vivid 3D faces free from non-facial areas, using a 2D image and a U-V texture map in the last stage.

Method
In this section, we first overview our shape reconstruction and texture completion network, SRTC-Net. Then we introduce our network architecture components, training details, and how we created our training data.

Overview
As Fig. 2 shows, SRTC-Net consists of three modules, the corresponding network (C-Net), the inpainting network (I-Net), and the shape network (S-Net). A face image first goes into the C-Net for face segmentation prediction in the segmentation subnetwork, and projected normalized coordinate code (PNCC) [47] prediction in the PNCC subnetwork. Using the PNCC prediction, we can map the segmented faces into U -V space to generate U -V texture maps. Then the U -V texture maps are fed into the I-Net for texture completion. Finally, we feed the segmented faces into the coarse-shape subnetwork to get coarse 3D faces, and acquire the detailed 3D face structure by predicting U -V displacement maps from complete U -V texture maps using the fine-shape subnetwork. Each subnetwork in SRTC-Net is trained separately. We now detail each component of SRTC-Net, and training data collection, in the following sections.

Overview
The C-Net consists of two pixel-wise prediction subnetworks, the segmentation subnetwork and the PNCC subnetwork. Pixel-wise prediction requires an appropriate network architecture. The proposed segmentation subnetwork and PNCC subnetwork are inspired by the image-to-image translation framework [20], where skip connections are used between corresponding layers in the encoder and the decoder. We also make additional modifications following Ref. [17]. The segmentation subnetwork takes in face images and outputs face masks, while the PNCC subnetwork takes segmented faces as input and outputs PNCC maps.
The output PNCC maps are used to map face images into our predefined U -V map.
Each foreground pixel I(x, y) of the input image has one predicted template NCC coordinate (N x ,N y ,N z ) stored in three channels of point P (x, y) in the predicted PNCC map. Given (N x ,N y ,N z ), we can find the corresponding vertex V i in the template NCC by nearest point search; V i has a unique corresponding pixel D(u, v) in the predefined U -V map. Finally, the pixel value of I(x, y) is assigned to D(u, v). We repeat the above process to get the U -V texture map.

Segmentation subnetwork training
Face images are detected and aligned with the face detector [48] at a resolution of 256×256. The groundtruth masks are binary maps with the same resolution. We employ the binary cross-entropy loss function to train this network: (1) where Seg p,i is the output of the ith pixel in the face image, Seg g is the ground-truth, and n is the number of pixels in the face image, i.e., 256 × 256.

PNCC subnetwork training
To reduce the computational cost, firstly, we crop the 690k PNCC-image pairs (see Section 3.6.4 using bounding squares of their images' foreground contour, and then resize them to 128×128 to give 690k cropped PNCC-image pairs P pc . We use the pixel-wise L1 loss between ground-truth and predicted PNCC maps to train the PNCC subnetwork: where P p is the predicted PNCC map and P g is the ground-truth.

Overview
We warp the 690k face images described in Section 3.6.4 to U -V space with their counterpart PNCC maps, and obtain 690k U -V texture maps. As these U -V texture maps are incomplete due to selfocclusion, we use the I-Net to fill in missing areas. Specifically, we introduce the adversarial training in this task.

Generator module
We employ the generator from Ref. [49] as our backbone network but make several modifications. Firstly, we employ skip connections between coarse layers of the encoder and decoder to preserve highfrequency details. As 10% of the above training data are half-occluded (i.e., have yaw of 90 • ), their U -V texture maps require about half of their area to be inpainted. Thus, we adopt dilated convolutions in coarse resolution layers following Ref. [50] to enlarge the receptive fields and provide better completion. To make our inpainted U -V texture maps more visually realistic, we superimpose a Gaussian noise map [51] on the output of each convolutional layer. We further concatenate the U -V texture map with its flipped version to form a six-channel input. We find that such a flipped U -V texture map provides better initialization for missing texture areas compared to random noise. The architecture of the generator is shown in Fig. 3. An L1 loss function is applied to train the I-Net: where , G is the generator, and T gt , T in , T f in are U -V texture map ground truth, input, and flipped input map, respectively.
To preserve the content, we also adopt the widely used perceptual loss, defined as where V is the activation map of a pre-trained deep model. We apply the VGGFace2 pre-trained model [52] as the feature extractor, and set pool3 as the activation map. We also tried an ImageNet pretrained model [53], but did not see any improvements.

Discriminator module
Although we can get a complete U -V texture map using the generator, the obtained texture lacks detail.
To improve the quality of the inpainted texture, we introduce adversarial learning. The loss function of the discriminator is as follows: where P and Q represent the distribution of groundtruth U -V texture maps and outputs of the generator, respectively. To train the discriminator, we use the following objective function:

Stable training
To train the I-Net stably, we enforce a gradient penalty regularization term [54] in the discriminator: where ∇D(T gt ) is the gradient of the discriminator with respect to its input T in .

Overview
To reconstruct 3D face models, firstly, we train a coarse-shape subnetwork to get coarse (i.e., smooth) shapes using the same 690k inputs as for the PNCC subnetwork and 32k ground-truth U -V position maps described in Section 3.6.5. Then, we predict local surface structures by training a fine-shape subnetwork with 32k ground-truth U -V texture maps and position maps.

Coarse-shape subnetwork
We use an encoder-decoder structure to transfer the RGB face images into a U -V position map. The encoder begins with one convolution layer followed by six convolution blocks, each of which includes a leaky ReLu activation layer, a convolution layer, and a batch normalization layer, which reduce the 128×128×3 input image into a 1×1×512 feature map. The decoder contains ten transposed convolution blocks, each of which includes a ReLu activation layer, a transposed convolution layer, and a batch normalization layer, and predicts the 84×84×3 U -V position maps. We also tried the FCN network from PRNet [10], but it did not converge.
To learn the parameters of our network, we use the reconstruction loss function defined as where S g is the ground-truth U -V position map, and S p is the output of the coarse-shape subnetwork.

Fine-shape subnetwork
In our work, we reconstruct 3D face detail structures by predicting a single-channel U -V displacement map, which represents the displacement of the U -V position map along the z-axis. The U -V displacement map is directly added to the z-axis channel in the U -V position map to generate the final face geometry. As in a previous method [55], we assume faces to be Lambertian surfaces, and apply a shape from shading technique to capture fine details. Such an approach has been demonstrated to be an effective way of recovering high-frequency structural details from RGB face images [17,56,57]. The fundamental of SFS-based refinement method is an image formation term linking the final face geometry and the input U -V texture map. The image formation term is usually defined as followŝ whereÎ is the reflected irradiance, R and l are the albedo map and lighting coefficient estimated by SfSNet [58], and H i (N ) are the spherical harmonic (SH) basis functions computed with unit normal N of the final face geometry. First, it is straightforward to measure the dense photometric discrepancy between the input U -V texture map I and the reconstruction I with the MSE loss function: where . denotes the l 2 norm. To prevent face shape degeneration, we add a commonly-used twonorm regularization term L regu to the output U -V displacement map. The loss weights of L regu and L sh are 1 and 100, respectively. For the fine-shape subnetwork, we employ the same architecture as the PNCC subnetwork, since its skip links can transfer detailed information from the RGB domain to the U -V position domain.

Overview
Training data plays an essential role in deep-learningbased computer vision algorithms. For our task, it is difficult and tedious to directly collect real face images together with their ground-truth 3D shapes. Existing public 3D face datasets, e.g., BU4DFE [59] and Bosphorus [60], are limited to constrained environments and a few hundred subjects, which are insufficient for training our SRTC-Net.
To address this problem, we compile our training data from three parts; some come from previous work and others are hand constructed. The first part contains image-mask pairs for the face segmentation network of the C-Net. The second one contains PNCC-image pairs to train the PNCC prediction network of the C-Net and S-Net, while the last one comprises complete U -V texture maps and U -V position maps to train the I-Net and S-Net.

Registered 3D face models
Most of our data generation relies on two kinds of 3D face models, so we start by introducing how to create them. We generate these face models in a way such that they have the same number of vertices, with uniform topology. Specifically, there are two kinds of registered 3D face models: the mean model and training models. For the former, we use the template model from the MeshMonk toolbox [61] as the source model, and the template model of Ref. [11] as the target model. Then we employ MeshMonk to automate a non-rigid transformation to get the mean model. To acquire training models, we non-rigidly register the template model of Ref. [61] to the texturesymmetry point clouds in Section 3.6.4. In total, we produce 32k training models.
The mean model is only used in face segmentation data generation (see Section 3.6.3). The training models are the ground-truth 3D shape in S-Net, and are also applied to generate PNCC-image pairs in Section 3.6.4.

C-Net face segmentation module data
We first collect two kinds of image from the Internet according to face pose; one set has 60k near-frontal face images, and the other has 7k face images with large yaw.
In order to automate the segmentation procedure on these near-frontal face images, we manually specify 60 3D landmarks on the boundary of the mean model. Then, we specify 60 counterpart 3D landmarks on the template model of Ref. [11] by finding the nearest vertex to each of the former 60 landmarks. For each of the 60k near-frontal face images, we get the predicted 3D model using the approach in Ref. [11], and project its 60 3D landmarks explained above into the image plane. Then, we find the convex hull of these 60 projected landmarks. Pixels inside the convex hull are foreground, and the others are background.
For the 7k collected face images with large yaw (from 30 • to 90 • ), we extract facial profile contours manually, and pixels inside the contours constitute the foreground. We then apply data augmentation methods such as in-plane rotation, flipping, and random cropping to increase the number of large pose-variation image-mask pairs to 70k. The final face segmentation training data has 130k image-mask pairs.

C-Net PNCC prediction module data
We select 32k image-mask pairs from the 130k training data explained in Section 3.6.3, and obtain their camera parameters and 3D models following Ref. [11]. To copy the pixel values of 2D faces to 3D models, we project vertices into image planes with predicted camera parameters, and assign image pixel values to the corresponding vertex. In this process, projected vertices which lie in the background or belong self-occluded areas have their texture values set to zero. For each of the above textured models, we then displace each vertex with zero value using its symmetric counterpart; we refer to this zero-valued displaced model as a texture-symmetry point cloud.
We warp the template model of Ref. [61] to all texturesymmetry point clouds, obtaining 32k textured 3D face models altogether.
Given 32k complete textured 3D face models, we generate the normalized coordinate code (NCC) [47] for each one, and set the texture value of each vertex to its NCC value. Next, we project the complete textured 3D face models and accompanying corresponding 3D face model with NCC texture into the image plane with poses at uniform intervals (with yaw from −90 • to 90 • in 10 • steps), giving 690k PNCC-image pairs for PNCC subnetwork training. All images of the 690k PNCC-image pairs are also used to train the coarse-shape subnetwork of S-Net.

I-Net U -V texture inpainting and S-Net U -V position prediction data
To unwrap a 3D face model to the U -V plane, we need parametric model of our 3D face. We thus use a conformal parametric method [62] to unwrap the template 3D face model of MeshMonk toolbox [61] to a square map. Our template 3D model contains 7052 vertices. Although the resolution of the U -V map can be arbitrarily defined, to reduce the computational cost, we set it to 84 × 84. Given the 32k complete textured 3D face models described in Section 3.6.4, we can generate 32k ground-truth U -V texture maps and U -V position maps to train the I-Net and S-Net separately.
The training data described in Section 3.6 is summarised in Table 1.

Experiments
In this section, we first evaluate landmark localization on the Multi-PIE dataset to compare our C-Net and the current state-of-the-art methods. Then we assess the I-Net on the two most widely used pose invariant face recognition datasets [36,63] by frontalizing faces with large pose variations. Finally, we compare our S-Net against currently popular 3D face reconstruction methods both qualitatively and quantitatively. In this section, to detect five keypoints of extreme-pose faces for alignment, we employ the OpenFace 2.0 toolbox [64].

Evaluation of C-Net
Since there is no publicly available dataset for comparing C-Net with previous methods on the dense correspondence between 2D images and 3D models, we measure the sparse landmark localization error on face images of varying poses instead. Visible landmarks on the face contour are significant indicators, as they are the boundary for U -V texture mapping. The Multi-PIE [63] dataset contains face images with large pose variation with 39 visible landmarks, among which we select images with extreme poses [±60 • , ±90 • ] from Session 1 as our test images. We discard 12 landmarks in areas not covered by our mesh and measure the remaining 27 landmarks inside the face profile on 256 images. We compare C-Net to 3DDFA methods [47], PRNet [10], and Deep3DFace [11]. For a fair comparison, we adopt the normalized mean error (NME) as our evaluation metric; the average landmark error is normalized by the nose length (distance from Point 1 to Point 4, using Multi-PIE landmark labels). Table 2 shows that our method outperforms previous methods. We give various profile landmark localization results in Fig. 4, demonstrating that C-Net can capture an   accurate correspondence between 2D face image and 3D model even for extreme poses.

Correspondence and texture inference
Previous methods build correspondences between a 2D facial image and a 3D model via low-dimensional 3D face reconstruction. We decompose the problem of 2D-to-3D correspondence building into two more manageable sub-tasks: face segmentation and PNCC prediction. We give a qualitative comparison of texture mapping from 2D images to 3D models between our method, PRNet [10], and Deep3DFace [11]. As shown in Fig. 5, the texture mapping results of PRNet and Deep3DFace contain non-facial areas that can neither be removed by their invisibility mask computed using standard Z-buffering [69,70] nor image inpainting algorithms. Our C-Net, however, builds more accurate correspondences between input faces and 3D models, resulting in better texture inference.

Pose invariant face recognition with I-Net
To evaluate the effectiveness of I-Net, we apply it to pose invariant face recognition (PIFR) by projecting the reconstructed 3D texture faces into the image plane with frontal view. We use the Celebrities Frontal-Profile (CFP) [36] and CMU Multi-PIE [63] datasets. Specifically, we fuse the feature vectors of probes and their frontalized versions extracted from a face network, and then compute the cosine distance between fused probe features and gallery features as the similarity measurement.

CFP dataset
CFP is a challenging dataset for PIFR algorithms, and here we focus on its most challenging task: frontal-profile matching. It contains 500 identities; each has ten frontal and four profile faces. The whole dataset has 7000 pairs for the task: 3500 same and 3500 differing pairs. For the evaluation, we employed two advanced face networks, ResNet50 and ResNet50-Dream (ResNet50 with embedded Dream Block) proposed in Ref. [68] as our feature extractor. The naive ResNet50 model in Ref. [68] already achieved an accuracy of 97% on frontal-profile matching, which is higher than UV-GAN (94.05%) [3] and ArcFace (95.56%) [71].
In this section, we evaluate the face recognition performance on our two incorporated pipelines and other methods. As shown in Table 3, although the two face networks, ResNet50 and ResNet50-Dream both obtain impressive results on CFP in a frontalprofile setting, our SRTC-Net still improves the results considerably. Aided by the SRTC-Net, the   Fig. 6. These results demonstrate that our method can infer complete facial textures robustly despite significant variations in pose, expression, and illumination in an unconstrained environment.

Multi-PIE database
The CMU Multi-PIE [63] dataset is the largest controlled multiview face recognition database. It was collected in four sessions; for evaluation, we use Session 1, which includes faces of 250 subjects. Following the two most popular test protocols in Refs. [30,74], we select images of the last 100 subjects for our evaluation. The gallery set comprises a frontal view face image for each subject under normal illumination. For probe sets, we only consider challenging faces whose poses are larger than 45 • . Altogether, there are 100 galleries and 16,000 probe images.
We employ the pre-trained Light CNNv2 model [75] as the face feature extractor. As shown in Table 4, although this powerful model achieves nearly perfect accuracy for non-extreme poses, our method still improves upon it for all poses; the accuracy increases by about 10% for full side views. We also exhibit face frontalization results for images from the Multi-PIE dataset in Fig. 7.

Evaluation of S-Net
To validate the efficacy of our S-Net, we assess it on the MICC Florence 3D Face dataset [79]. MICC contains 53 subjects, each of which has a 3D scanned model and three videos captured in cooperative, indoor, and outdoor scenarios. We compare our method with several state-of-the-art image-based 3D face reconstruction methods [10,11,47,72] in indoor scenarios. We crop the ground-truth face scans to 100 mm around the nose tip and remove the forehead region of the scans following Ref. [11].
In this experiment, we extracted an image every ten frames from each video, and used a total of 7895 frames from videos of 53 subjects.
For 3DDFA x , VRN y , PRNet z , Deep3DFace { , and our S-Net, we reconstructed an average shape from all frames of one subject to evaluate the error. To compute error distances, we used a publicly available evaluation tool | . Specifically, we employ 7 points: x https://github.com/cleardusk/3DDFA y https://github.com/AaronJackson/vrn z https://github.com/YadiraF/PRNet { https://github.com/microsoft/Deep3DFaceReconstruction | https://github.com/patrikhuber/fg2018-competition the inner and outer eye corners of both eyes, the nose-tip, and the mouth corners to rigidly align the predicted mesh with the ground truth. The distance error for each ground-truth vertex is the distance between the vertex and the closest point on the surface of the predicted mesh. For comparison, we use root mean square error (RMSE) to measure the difference between the predicted mesh and the groundtruth mesh. Table 5 shows that our reconstruction performance is close to Ref. [11] and better than Refs. [10,47,72]. Various qualitative results on the MICC dataset are shown in Fig. 8. Additionally, we display a qualitative comparison for some inthe-wild images in Fig. 9, which shows that our method provides better geometry fitting and facial identity detail preservation. To make our framework computationally efficient, we restrict the resolution of the texture map to 84 × 84, which may limit reconstruction detail. Thus, we only compare our method to low-dimensional methods to show that our completed U -V texture can provide detail structures beyond 3DMM methods.

Conclusions and limitations
This paper has proposed the deep SRTC-Net to attack the challenging problem of inferring 3D facial geometry and texture from a single image. Our method consists of three subnetworks, the C-Net, I-Net, and S-Net, which decompose the hard problem into several more tractable problems. The C-Net builds accurate correspondences between 2D face images and 3D template models with two pixel-wise prediction subnetworks. Our carefully designed I-Net completes missing areas of U -V texture maps. Our reconstructed U -V texture maps are applied in the S-Net to extract detailed surface structures. Specifically, the S-Net reconstructs a coarse shape (U -V position map) first, and then regresses 3D surface  details (U -V displacement map) from the complete U -V texture maps pixel-by-pixel. A comparison between our C-Net and state-of-the-art face reconstruction methods on the Multi-PIE dataset demonstrates the effectiveness of C-Net. For 3D facial geometry reconstruction, we get better or close quantitative performance to state-of-the-art deep learning based methods on the MICC dataset. Furthermore, our methods achieve better qualitative results than previous low-dimensional methods on unconstrained images. Our face recognition results on the CFP and Multi-PIE datasets demonstrate that our method can also significantly promote the performance of face recognition models on the PIFR task. Our framework is trained separately, which makes our method a little complicated. Although symmetry is reasonable for a facial texture map, it may not be appropriate for some facial details: e.g., a crooked smile can break the symmetry of facial texture. The cascading structure of our whole framework means that error in each sub-module is transmitted to next stage. Our work cannot provide accurate PNCC of face areas occluded by non-facial objects, such as a hat. In future, we hope to make our framework more easily trainable and more robust to extreme expressions.