Sphere Face Model: A 3D morphable model with hypersphere manifold latent space using joint 2D/3D training

3D morphable models (3DMMs) are generative models for face shape and appearance. Recent works impose face recognition constraints on 3DMM shape parameters so that the face shapes of the same person remain consistent. However, the shape parameters of traditional 3DMMs satisfy the multivariate Gaussian distribution. In contrast, the identity embeddings meet the hypersphere distribution, and this conflict makes it challenging for face reconstruction models to preserve the faithfulness and the shape consistency simultaneously. In other words, recognition loss and reconstruction loss can not decrease jointly due to their conflict distribution. To address this issue, we propose the Sphere Face Model (SFM), a novel 3DMM for monocular face reconstruction, preserving both shape fidelity and identity consistency. The core of our SFM is the basis matrix which can be used to reconstruct 3D face shapes, and the basic matrix is learned by adopting a two-stage training approach where 3D and 2D training data are used in the first and second stages, respectively. We design a novel loss to resolve the distribution mismatch, enforcing that the shape parameters have the hyperspherical distribution. Our model accepts 2D and 3D data for constructing the sphere face models. Extensive experiments show that SFM has high representation ability and clustering performance in its shape parameter space. Moreover, it produces high-fidelity face shapes consistently in challenging conditions in monocular face reconstruction. The code will be released at https://github.com/a686432/SIR


Introduction
The problem of face reconstruction from images and videos has been attracting considerable attention in the computer vision and computer graphics community. It has a broad range of applications, including AR/VR [1], animation [2,3], computer games [4], etc. In recent years, there is a growing demand for customizing 3D virtual faces to create game characters [4,5] or personalized 3D facial editing [6]. In such applications, images from common users usually come from a large diversity of conditions, including occlusion, resolution, pose, expression, illumination, etc. It is thus challenging to reconstruct a face from only a single image requiring both shape faithfulness and identity preservation.
Although previous works [7,8] claimed to have achieved face reconstruction from a single image, their reconstructed face shapes suffer from inconsistent identity properties when the input images have varying conditions. To address this problem, the follow-up works [9][10][11] propose to aggregate shape parameters of the same identity while separate those of different subjects to produce 3D face shapes containing good identity-related features. However, the conflict between the shape loss and the identity loss in their reconstruction pipeline prevents them from achieving both shape fidelity and identity consistency. That conflict comes from the mismatch between the distribution of identity embeddings of face recognition and shape parameters of the previous 3D morphable models (3DMMs) [12][13][14][15], which maximize their model expression ability while neglecting some distinguishable information of categories.
Therefore, this paper focuses on identity-consistent face reconstruction in a linear model. To resolve the aforementioned distribution mismatch problem, we propose a novel face generation model called the Sphere Face Model (SFM). We add category information while building the basis of SFM and constrain identity parameters over a hypersphere by normalizing the shape parameters to make the shape parameter space of SFM consistent with identity latent space (the detailed definition of geometric space, shape, parameter space, and identity latent space are in Section 3). In this way, we resolve the conflict between the two losses and further improve the identifiability of 3D face models. Moreover, SFM has an essential property that the discrimination of the parameters is transferable to the geometry, which means the Euclidean distance between two sets of 3DMM parameters in shape parameter space and between corresponding mesh vertices in geometric space have a positive correlation. One notable challenge is when the identity parameters are forced to be distributed over a hyperspherical surface, the L2 norm of the parameter vectors becomes the same. In other words, the reconstructed faces would have the same root mean square errors from the mean face, leading to reduced varieties of generated faces. We use two approaches to address that issue. Algorithmically we add a parameter to control the scale of the shape parameters of each face. While previous approaches mainly use 3D training data, which are limited, we propose a two-stage training approach where we use 3D data only for pre-training and adopt an unsupervised learning approach that can leverage a sufficient amount of 2D face data. Figure 1 highlights the differences between our face model and the previous 3DMMs. The parameter of SFM is composed of a shape parameter and a scale parameter. The identity parameter is the normalized shape parameter, which controls the face's identity attribute. It is distributed on the hypersphere with good separation properties. The scale parameter controls the distance to the average face.
The main contributions of this paper lie in the following three aspects: • We propose Sphere Face Model (SFM) for 3D face reconstruction from single images with both shape faithfulness and identity consistency. • We propose a new structure of 3DMMs, where the shape parameter space follows a hyperspherical distribution and the discrimination of shape parameter space is transferable to the geometric space. • To enable SFMs to reconstruct high-quality 3D face models from single images, we present a learning scheme to train SFMs with both 2D and 3D data.

Related work
3D morphable models map the high-dimensional face geometry space to the low-dimensional manifold space. Based on 3DMMs, the previous works optimize the low-dimensional 3DMM parameters from the input image to reconstruct high-dimensional face geometries in monocular face reconstruction. Meanwhile, many works introduce identity loss in the face reconstruction pipeline to keep the face shape stable from the various input images. This section introduces the related works from three aspects: 3D morphable model, shape-consistent face reconstruction from monocular images, and deep face recognition. 3D morphable models. 3D morphable model is a statistical model of the distribution of the faces, which maps the low-dimensional parameter vector to the high-dimensional graphic vertices. The groundbreaking work of 3DMMs traces back to Blanz and Vetter [16], who propose the 3D morphable model using principal component analysis from an example of 200 3D faces. Based on this idea, Paysan et al. [12] provide the first public 3DMM model, BFM 2009, and others [17][18][19][20][21][22] extend the model to introduction emotive facial shapes information by adopting an additional expression basis or using bilinear and multilinear. Ref. [13] provides the whole head model, FLAME, which introduces an articulated jaw, neck, and eyeballs in linear shape space and global expression to make the model more expressive. Yang et al. [23] present a large-scale detailed 3D face dataset and model the variation of detailed geometry with it. Unlike the previous work, we consider identity information while constructing the 3DMM model, and the shape parameter can be inherently separated among each identity. Blanz and Vetter [16] only use facial meshes of 200 subjects of similar ethnicity and age, which cannot represent the great diversity of the human faces. Ref. [15] trains the 3DMM with the large scale of 3D data to overcome this limitation, but the 3D data are also limited. Refs. [24][25][26] use sufficient 2D data to training the 3DMM. However, training with 2D data without 3D prior needs strong regular terms, which leads to a lack of geometric details and diversity. Our method training the model makes full of 2D and 3D data. In recent years, with the development of deep learning, Refs. [24,26,27] propose nonlinear models with encoder-decoder structure. Those nonlinear models do not consider the parameter separation and the property of propagating the discrimination from shape parameter space to geometric space when training the models.
Shape consistence monocular face reconstruction. Early works [28][29][30][31][32] reconstruct 3D face from monocular RGB using the analysis-bysynthesis approach with the prior knowledge of the 3DMM. They often apply the photometric and landmark consistency between the input and the rendered image. In recent years, many researches [9,[33][34][35][36] have proposed the deep network to regress the 3DMM parameters. Applying face recognition loss to the rendered image mainly affects the recognizability of the texture, which has a relatively small impact on shape consistency reconstruction. Adversarial loss, perceptual loss, and identity loss on the rendered image [26,[37][38][39][40] are proposed to generate the high fidelity texture. However, applying face recognition loss to the rendered image mainly affects the recognizability of the texture, which has a relatively small impact on shape consistency reconstruction. Feng et al. [41] replace the shape parameter of the same person and employ the photometric and identity loss on the rendered images. However, it fails to distinguish shape parameters of different people. To reconstruct the stable face shape geometry, Tran et al. [10] label a large number of face images with 3DMM shape parameters using the optimization method, and utilize the deep CNN to learn the mapping from images to shape parameters. But its performance depends on the accuracy of the optimization method. Liu et al. [11] and Sanyal et al. [9] use a face recognition loss to push away the shape parameters of different people while aggregating those of the same person. Jiang et al. [42] propose that simply applying the face recognition loss function to the shape parameter does not guarantee shape consistency. They explore the relationship of shape parameter discrimination and geometric visual discrimination and propose the SIR loss, which increases discriminability in both the shape parameter and shape geometry domain. Since they use the PCA-based face model, it is challenging to preserve faithfulness and shape consistency simultaneously.
Deep face recognition. In recent years, many works have achieved incredible face recognition accuracy with the powerful deep convolutions neural network. Most of them focus on cleaning and mining the training data or designing the loss function to maximize the intra-class distance and minimize the inter-class distance, which boosts the discrimination of deep feature identity embedding. There are mainly three types of loss functions for face recognition. One utilizes pair or triple training strategy, such as contrastive loss [43] and the Triplet loss [44]. Another type of loss, like the center loss [45], plays as the auxiliary loss to augment the other loss functions. The aim of these loss functions is aggregating features to minimize the inner-class distance. The auxiliary loss can be directly added to the classifier network and learn the discriminative features. The last type of loss is modified softmax [46][47][48][49][50][51]. NormFace [47] and Cocoloss [48] normalize the weights and features and directly optimize the cosine similarity instead of the inner product. L-softmax [46] and SphereFace [49] introduce the multiplicative cosine margin. CosFace [51] and Am-softmax [52] introduce the additive cosine margin, and ArcFace [50] introduces the additive angular margin. Refs. [53][54][55] adapt the margin during the training. Current SOTA deep face recognition methods mostly adopt the last type of loss and softmax-based classification loss. Their identity latent space is the hypersphere.

Space distribution
This section elaborates the characteristics the identity latent space needs to have for effective face representation and reconstruction. Before introducing our method, we first introduce the terminologies as well as several key concepts: geometric space, shape parameter space, and identity latent space. Geometric space Ψ is a set of face meshes, which is formulated as Ψ ∈ R N v . N v is the number of vertices of a face mesh. Shape parameter space Φ is a set of shape parameters of 3DMM, which is formulated as Φ ∈ R N p . N p is the dimension of shape parameters. Identity latent space Ω is a set of identity embedding which is formulated as Ω ∈ R N i . N i is the dimension of identity embedding.
These PCA-based 3DMM suffer from the conflict between the losses for face recognition and reconstruction in the shapeconsistent face reconstruction pipeline. Specifically, the shape parameters for face reconstruction satisfy the anisotropic multivariate Gaussian distribution [16].
where α is the shape parameter, Σ = {e 1 , e 2 , ..., e n }, and e i is the ith eigenvalue of shape basis. However, these eigenvalues are significantly variance (e 1 : e 199 ≈ 400), making the distribution a hyper-ellipsoid with a high eccentricity as shown in Fig. 1.
In contrast, the identity embeddings for face recognition are distributed isotropically on the hypersphere, which is first introduced in Ref. [47]. And modern face recognition methods [50,53,54] follow this identity embedding distribution: where β is the identity embedding. The distribution mismatch in the shape parameter space of face reconstruction and identity latent space of face recognition makes the co-convergence of these two loss functions (face recognition loss and face reconstruction loss) very difficult to achieve. More specifically, when conducting the intense face recognition loss, the latent vectors are forced to distribute on a hyper-spherical surface which do not follow the actual distribution of shape parameters and make the reconstruction results inaccurate. On the contrary, employing an intense reconstruction loss would probably make the distribution of latent vector to be no longer hyperspherical, resulting in less identity-consistent reconstruction results. Note that nonlinear face models [24,26,27], which also belongs to the family of 3DMMs, are not guaranteed to transfer the discrimination of the shape parameter space to the geometric space as explained in Section 3.2, and thus cannot preserve identity information while constructing face models. To address the above issue, we propose to keep the shape parameter space of SFMs consistent with identity latent space of face recognition. The significant difference between our method and the previous methods [13,14] are that the latent space of our 3DMM model is a hypersphere with an isotropic distribution. In contrast, the previous 3DMM model is a hyper-ellipsoid with a large eccentricity. Additionally, it should meet the requirement that discriminability can be transferred between the shape parameter space and the geometric space. Here, we first introduce the identity latent space distribution of identity embeddings and then describe how we design the structure of SFMs and the concrete constraints the SFM should satisfy.

Hypersphere manifold of identity embedding
Modern face recognition works always adopt the softmax-based classification loss for metric learning, where weights w and identity embeddings l are normalized and the concept of margin [49][50][51] is adopted to boost discrimination of deep face features further. In particular, a loss function with margin can be formulated as Eq. (3): of the ith sample, and y i denotes the label of x i . w i is the ith column the normalized weight before softmax [47]. s re-scales the cosine embedding. θ j is the angle between vector x i and class vector w j in the identity latent space. m and n denote the batch size and the class number respectively. The parameters α, β, and γ encode the margins of different kinds (see SphereFace [49], CosFace [51], and ArcFace [50]). The identity embedding trained with softmaxbased classification is distributed on a hypersphere. Previous works [11,42] impose the softmax-based loss on shape parameters. However, the face parameters constrained by the face recognition loss function will make the face parameters tend to have a hyperspherical distribution. On the other hand, these parameters must meet the distribution of PCAbased basis (the anisotropic multivariate Gaussian distribution) to have a better result of face shape reconstruction. Therefore, for the face recognition function to better affect the geometric separation, we must reconstruct and establish a reconstruction base with a similar distribution to identity embedding. However previous works [13,26] on conducting the shape basis did not emphasize this.

Shape parameter space of Sphere Face Models
As mentioned above, the established SFM should meet the following criteria: (1) The discriminability of the shape parameter space can be transferred to the discriminability of the geometric space; (2) the distribution of the SFM parameters is consistent with the distribution of the face recognition identity embeddings, that is, the isotropic hyperspherical distribution. For the first criteria, SFM shape parameter space has to meet the conditions in Eq. (4): is a linear function and the basis (mentioned later in Section 4) is orthonormal, the above condition can be met (the property is proved in Ref. [42]). Thus we use orthonormal basis in SFM.
To meet the second criteria, we normalize the shape parameters in SFM. As a consequence, the vector of shape parameters will be constrained on the hypersphere, leading to the cosine angle between two vectors proportional to their distance in the geometric space. This also brings up a problem that the distance between the result of all human faces and the average face become the same, since all human faces would have the same distance from the origin of the coordinates. Our solution is to add a scalar to control the norm of the face parameters. Similar as Ref. [56], we use scale-normalized shape parameters, namely identity parameters, since they are related to identity information. The scale parameter represents the difference with the mean face. Previous work [57] also proposes decomposition networks, but their model did not consider the above situations, making it impossible to use face recognition loss on shape parameters to improve the degree of parameter separation further.
To summarize, our SFM consists of a scale parameter s and a vector of shape parameters x to describe a face model.

Sphere Face Model
Given the shape parameters x and the scale parameter s, our Sphere Face Model is able to reconstruct the 3D face shape by where M ∈ R 3n is a reconstructed 3D face shape with n vertices andM ∈ R 3n is the mean face shape. The normalized term x/ x represents the identity parameters. The orthogonal matrix A represents the basis of SFM, which is obtained by a joint 2D-3D learning framework based on deep neural networks. This structure guarantees s * x x located on the hypersphere.
The previous works for constructing parameterized models mainly rely on 2D or 3D datasets. However, only training the model with 3D models would lack face variants because there is no publicly available large 3D face datasets. Training only with a twodimensional dataset is also difficult to get satisfying results since the large diversity of expressions and poses will affect the identity-related features in the reconstructed face models without 3D shape guidance. The regularization constraint used in these methods [24,26] also makes the generated mesh similar to with the average face. Tran et al. [26] use the proxy strategy to alleviate that issue but did not fully solve it. Therefore, we propose an effective learning scheme to utilize 2D and 3D data to learn face models with the aforementioned properties.
In the following sub-sections, we introduce the overall framework and then describe how the deep model is trained using 3D and 2D face data.

Learning framework
Given the model defined in Eq. (5), our goal is to learn the basis matrix A from face datasets. To achieve so, we adopt a two-stage training strategy as illustrated in Fig. 2. In the first stage, we feed the model with scale and shape parameters and force the model to reconstruct the 3D face. We optimize the basis matrix, scale parameters, and shape parameters by minimizing the objective function as shown in Eq. (9). After this step, we obtain a basis matrix, which is rough due to the scarcity of the 3D training data. In the second stage, we make use of the large 2D face datasets and train an encoder-decoder style model similar to Refs. [25,26,58]. The pre-trained SFM can be regarded as a decoder module that can reconstruct the 2D face image along with other decoder modules using the latent vector from the encoder. By optimizing the encoder and decoder, our SFM is finetuned. Fig. 2 Framework of our method. The normalization of x generates the identity parameters distributing on a hypersphere. The normalized identity parameter is multiplied by the scale parameter to get the shape parameter and goes through the basis to get the corresponding mesh. When training on 3D data, we directly optimize s and x. When training on 2D data, we use encoder-decoder because it requires other parameters to render the image.
More specifically, the encoder regresses the scale, shape, expression, and other rendering parameters, such as albedo, illumination, pose, and camera parameters. In the decoder part, we have four components, each of which is to be trained in this stage: (1) the trained shape basis of SFM, (2) the expression basis D exp from BFM2017 [14], (3) the albedo basis D albedo from Ref. [59], (4) the rendering layer takes the geometric, albedo, illumination, pose parameter, and camera parameter and renders 224×224 RGB images, which is based on Pytorch3d [60]. The illumination model is a spherical harmonic illumination model.
In previous works, Ref. [26] does not use 3D prior when constructing face models from 2D data; Ref. [25] creates a new basis besides the 3DMM to correct face shape; Ref. [58] directly regresses the residual displacement in geometric space to correct the face shape. In contrast, our work directly corrects the 3D prior basis by decoupling the expression and appearance information in 2D data, which is able to learn better identity-related features for face reconstruction.

Data preparation
3D data. FRGC v2.0 database [61] contains 4007 3D face scans of 466 subjects and is acquired by a Minolta Vivid 900/910 series sensor under controlled illumination conditions. In the preprocessing, we use a non-rigid iterative closest point algorithm [62] to register the 3D face raw scans to the topology of BFM2017 [14] and remove the sample with radical expressions. The registered 3D models face the positive direction of the z-axis, and their centers are coincident with the origin. Note that the unit of the registered 3D model is the millimeter.
2D data. The second stage is trained with 300W-LP [63] and VGGFace2 [64]. VGGFace2 contains 3.31 million images of 9131 subjects covering a large range of poses, ages, and ethnicities. 300W-LP is a synthetically generated dataset based on the 300-W database [65] containing 61,255 samples across various poses. In our preprocessing stage, the faces are aligned using similarity transformation and cropped to 224×224 in the RGB format with its landmark of 300W-LP.

Training Sphere Face Model with 3D data
SFMs are first trained with 3D data to learn the shape basis using the loss function in the following.

Loss function.
To assemble the identity parameters of the same identity and separate those of different identities in cosine distance, we apply the modified-Softmax loss with normalized shape parameters and normalized weight, which is introduced by the Normface [47]: where n is the number of classes and m is the number of samples of the batch. y i is the ground-truth label. w j represents the jth row of the basis A. At the same time, we aggregate the scaled identity parameters s * x x of the same identity to its center c and separate the centers of different identities in Euclidean distance: where c i represents the center of the ith class. Finally, we minimize the reconstruction error with basis regularization: where M is the ground-truth mesh andM is the reconstructed mesh. I is the identity matrix and w a is the weight of the loss function. Finally, we optimize the objective function and solve the target basis A in Eq. (9): Hyperparameter setting. We use the Adam optimizer, where the initial learning rate of x and s is 0.02 and that of the learning rate of A is 0.005. The batch size is 512, and the learning rate is reduced to one-tenth for every 20 epochs.

Training Sphere Face Model with 2D data
In the second stage, we train a model to reconstruct the 2D face image. Here, the decoder is initialized by the first stage and will be finetuned during this stage. Here, ε denotes the weight of a loss term. Loss function. The loss function consists of three components: landmark loss, photometric loss, and recognition loss. The landmark loss and recognition loss would take effect according to the label of training data as L = L pix (I r , I) + ε l L land + ε r L reg , I ∈ S recon L pix (I r , I) + ε s L id + ε r L reg , I ∈ S id (10) where L pix is the photometric loss, L land is the landmark loss, and L id is the recognition loss. I r is the rendered image and I is the input image. The set S recon represents the training data with landmark annotations and S id is the the training data with identity annotations. We explain these losses in detail below.
The landmark term L land uses the L 1 loss between projected landmarksV 2d and ground-truth landmarks V 2d : where N is the number of landmarks. Face recognition loss includes three components as shown in Eq. (12): a softmax-based loss, a centerness loss, and a Kullback-Leibler loss.
We use CosLoss [51] L soft as the softmax-based loss, which applies to the identity parameters. The Kullback-Leibler loss [66] L kl and center loss L center [45] are applied to the scale parameter. Photometric loss measures the difference between the rendered image and the input image using pixel-wise differences to measure the absolute errors between each corresponding pixel pair with the weights of a confidence map [67], which aims to deal with occlusions or other challenging appearance variations such as beard and hair. The weighted pixel-wise loss is defined as where 1,uv = |I uv r − I uv | is the L 1 distance between the intensity of input image I and the reconstructed image I r at location (u, v) and σ ∈ R W ×H + is the confidence map. Ω is the 2D image space.
As shown in Eq. (14), the regularization term L reg consists of two parts: parameter-level regularity loss L preg and mesh-level regularity loss L mreg .
The regularization term of L preg for 3DMM coefficients is defined as where σ exp is an eigenvalue of the expression basis and σ alb is an eigenvalue of the albedo basis. α id , α exp , and α alb are the 3DMM parameters which are regressed by the encoder network as shown in Fig. 2; m id , m exp , and m alb are the dimensions of the shape, expression, and albedo parameters, respectively. The mesh-level regular loss consists of the smooth loss, the symmetrical loss, and the residual loss.
where G is the reconstructed face shape, N i denotes a set of a neighboring vertices G i , and N is the number of vertices. We assume that the human faces in natural expressions are symmetric about the center axis and add the face shape geometry symmetrical loss: where flip() is the operation to flip the face shape geometry. The residual loss is whereḠ is the mean face geometry. More training details. Currently, there are no large public databases that contain both face identity labels and landmark labels. Moreover, since the results of existing face detectors are unsatisfactory in challenging conditions, we do not automatically generate landmarks in the face recognition dataset. Therefore, we use the mixed data from 300W-LP [63] and VGGFace2 [64]. To successfully train our model with the mixed dataset, we use the following strategy to achieve convergence: (1) Switch the loss function: Because the labels in the mixed database are deficient, we determine which loss terms take effect according to the labels of the training samples. For example, if the training sample is from VGGFace2, we enable face recognition loss and photometric loss. Otherwise, the landmark loss and photometric loss take effect as shown in Eq. (10).
(2) Warm up the network: To warm up the network, we train our network on the 300W-LP [63] database only using S recon , and then train the mixed database with the full loss function shown in Eq. (10).
(3) Balance the data from different datasets: Because the VGGFace2 contains 3.31 million images while 300W-LP [65] contains 61,255 samples, which are extremely unbalanced, we design a sampling scheme where the probability of selecting samples from the VGGFace2 is given by (20) Here, N recon is the number of samples in 300W-LP dataset and N recog is that in VGGFace2 dataset. The probability of selecting samples from 300W-LP database is 1 − P .

Experiment
Comparing with the previous methods, SFMs have the following properties: (1) The shape parameter space of SFMs has inherent separation property between the various classes; (2) the shape parameter space distribution of SFM is similar to that of identity embeddings, so that the losses for face recognition and face reconstruction can be easily optimized together in the pipeline of shape-consistent face reconstruction; (3) SFM has better capabilities for face representation. Therefore, in this section, we evaluate SFMs from the following three aspects: model representation ability, shape parameter space separability, and shapeconsistent monocular reconstruction performance.

Model representation ability
To validate the expressive ability of face representation models, we reconstruct 3D meshes on the training and testing database, with the same dimension of the latent vector (all of 199 dimensions in this paper). Evaluation of the training database shows the ability of the models to recover the meshes of the training data. We also verify the generalizability of our model by fitting meshes for the testing database. We also present the result of the parameter interpolation. Our training dataset is FRGCv2 [61], and the testing dataset is the Bosphorus database [68], which contains 4666 3D face models of 105 people. The models for each person have various expressions, poses, and occlusions. In our experiment, we select the face with a frontal natural expression for each person, and register all the data on the BFM2017 [27] template. We first use rigid registration [69] to align the template with the point cloud roughly and then use non-rigid ICP [62]. When performing non-rigid registration, we first register with strong, rigid regular parameters and then use smaller regular parameters to perform more delicate meshes registration.
In our experiment, we select the face with a natural frontal expression for each person and register all the data on the BFM 2017 [27] template using non-rigid ICPs [62]. We use the Adam optimizer to optimize the face parameters of the model. The initial learning rate is 0.02 and reduced by a factor of 0.5 for every 128th iteration. The total optimization iteration number is 1000.
The Root Mean Square Error (RMSE) between the reconstructed meshes and the ground truth for the training dataset is shown in Table 1, and that for the testing dataset is in Table 2. We use the face model trained with FRGCv2 but use different methods when generating the above results. "PCA" means that the face model is directly established by the PCA method. "Linear" means that the face model is established by optimizing an orthogonal linear basis. "Sphere-Linear" refers to using the structure of SFM without the loss of face recognition when constructing the face shape. The expression ability of our SFM basis is slightly better than that of the linear basis but worse than PCA. Because when the face model has a linear orthogonal basis, the basis solved by PCA has the smallest reconstruction error, which is the optimal solution. Our reconstruction accuracy is slightly lower than PCA's, but has a better separation in the shape parameter space. Table 3 shows the comparison between SFMs with the shape models of BFM2017 and FLAME on the Bosphorus database. We crop the face area for fitting because other areas (ears, neck) are irrelevant to our  Table 3 Comparasion of the model representation ability and its shape parameter separability with BFM17 [14] and FLAME [13]  show that our face model has fewer reconstruction errors than others. Figure 3 shows some fitting results on the Bosphorus database. SFMs are competitive among all the validated 3DMMs in terms of expressive ability, with the best visual quality of the generated reconstruction results. Figure 4 shows that the parameters of our basis have an excellent interpolation performance. We use the geodesic distance to interpolate the identity parameters and directly interpolate the scale parameters linearly.

Separability of shape parameter space
After fitting all the 3D scans of a database, we get the parameters of the corresponding 3DMM Fig. 3 Fitting results of BFM17 [14], FLAME [13], and ours. The first row is the fitted mesh and the second row is the error map with ground truth. SFM* is the SFM finetuned with 2D data Fig. 4 On the left and right are two different scanning models. We first find their identity parameters and scale parameters. Then we perform the interpolation of the identity parameters on the hypersphere and perform linear interpolation on the scale parameter. Columns 2-5 are the result of interpolation. model in this database. We can evaluate the clustering properties of these parameters to estimate the degree of separation of shape parameter space. The performance of clustering can be evaluated with the following metrics: the Silhouette Coefficient score with Euclidean distance (SCE), Silhouette Coefficient with Cosine distance (SCC), and Calinski-Harabasz score indicators (CH). The Silhouette Coefficients are given as where a i is the mean distance between the ith sample and all other points of the same class and b i is the mean distance between the ith sample and all other points of the nearest cluster. n is the number of the sample. The score is the ratio of the sum of betweencluster dispersion and of within-cluster dispersion for all clusters (where dispersion is defined as the sum of squared distances). The Calinski-Harabasz score (CH) is defined as the ratio of the between-cluster dispersion mean and the within-cluster dispersion: where B k is the trace of the between-cluster dispersion matrix and W k is the trace of the withincluster dispersion matrix defined by with C q the set of points in cluster q, c q the center of cluster q, c E the center of E, and n q the number of points in cluster q. Table 1 and Table 2 show the results of shape parameter space separation of SFMs and the shape basis constructed by other methods. We add the face recognition loss while establishing the SFM basis, significantly improving shape parameter space separation. Table 3 shows the comparison between our basis and other basis. The separability of our shape parameter space is also higher than other models. In order to present the distribution of shape parameter space more intuitively, we use t-SNE [70] to project the shape parameters of different bases to two dimensions. As shown in Fig. 5, the intraclass distance of the shape parameter space of SFM is small, and the inter-class distance is large. Compared with other methods, the shape parameter space of our basis shows a much better separation.

Monocular reconstruction
To test the face monocular reconstruction, we use the same encoder-decoder network in the second training stage as the shape-consistent face reconstruction pipeline. Unlike the training phase, when performing inference for monocular reconstruction, we fix the weight of the shape basis and retrain the encoder to regress the parameters. In this section, we evaluate the faithfulness and shape consistency of monocular face reconstruction results using SFM. In terms of faithfulness, we compared the visual results with other face reconstruction methods. Moreover, we compare the accuracy of 3D face alignment. In terms of shape consistency, we compare the accuracy of the face recognition using the shape parameters and the visual results of the same person reconstructed in different environments. In this subsection, when comparing with methods, "ours" means that we use the SFM finetuned with 2D data.
Face shape consistency evaluation. We use the cosine distance and Euclidean distance as the similarity measurements between two groups of shape parameters, when evaluating the face recognition accuracy. The result of face recognition performance is shown in Table 4. The accuracy of our face recognition parameters is higher than other methods. The results get better after SFM is finetuned with 2D data because finetuning with 2D face data results in a more robust generalization model. Moreover, thanks to the similar latent space distribution of Table 4 Face verification accuracy (%) on the LFW, CFP-FP, and YTF datasets. Our results are obtained using the weighted center loss. We compare our results with 3DMM-CNN [10], Liu et al. [11], D3FR [34], TDDFA [33], MGCNet [72], Jiang et al. [42], RingNet [9], and DECA [41]. PCA is the PCA-based model using the same 3D face dataset as SFM. SFM* is the SFM fine-tuned with 2D data SFM and face recognition, SFM has a higher degree of separation of parameter space than the traditional PCA model. Figure 10 shows the visualization results of the 3D face reconstructed by the same person in different environments. We have smaller errors among the meshes reconstructed for the same person. Face faithfulness evaluation. As shown in Fig. 6, finetuning with 2D data can improve the expression ability of the model and generate faces with more details. Compared with PCA, SFM has more features of face identity. Figure 7 shows that our reconstruction results capture more face details compared to other face reconstruction methods. Figure 11 shows the cumulative error distribution curves of 3D face alignment compared with other methods. Figure 8 shows the visual results of face alignment and Fig. 9 shows the visual results of face shape. Both quantitative and visual evaluations show that in terms of face faithfulness our method has better performance than previous methods.
User study. We conducted a user study to Fig. 6 Ablation experimentation samples from MICC [71] dataset. PCA is the PCA-based model using the same 3D face dataset as SFM. SFM* is the SFM finetuned with 2D data.
compare the visual diversity and the degree of retention of the reconstructed face shape on the identity information. We randomly selected 20 face images from ALFW2000 and reconstructed 3D face models using the following methods: RingNet [9], D3FR [34], MGCNet [72], and our SFM, and in turn asked 5 participants to evaluate the reconstructed faces' diversity and the retention of identity information of the reconstructed faces from the input image with a score from 0 to 10. Participants were told that the reconstruction results with more identity information maintained or more diversity of different people should be scored higher.
The average scores of the results from different methods are shown in Fig. 12. The "identity" means the degree of identity preservation, and the "diversity" means the diversity among the 3D faces reconstructed from different people. Results and comparisons vividly show the advantages of our method.

Limitation
The defects in our results where eye region is of low quality are mainly caused by the relatively noisy eyes of the trained 3D dataset (FRGCv2 [61]). This is also a disadvantage of our method, which requires multiple scans of neutral faces in many different people. Note that even the latest high-quality public 3D face datasets have a limited number of individuals, and each individual has only one neutral scan. In the user study, the identity scores of the three methods are comparable, which is also due to the limitation of Fig. 8 Visualization results on ALFW2000 dataset. The first row: images from ALFW2000 dataset. The second row: the result of 3DDFA v2 [33]. The third row: the results of RingNet [9]. The forth row: the results of DECA [41]. The last row: the results of ours.

Fig. 9
Some samples of user study. The first row: samples from ALFW2000 dataset. The second row: our results. The third row: the results of RingNet [9]. The forth row: the results of D3FR [34]. The fifth row: the results of MGCNet [72]. The last row: the results of DECA [41].
current 3D face dataset. Nevertheless, our main goal is to reconstruct the stable face from the different conditions with SFM, as shown in Fig. 10, which outperforms the previous method. Fig. 10 Comparison with D3FR [34] and MGCNet [72] on LFW samples. The reconstruction results are the same person under different conditions. The third and sixth rows are the error between two meshes in the same column.

Fig. 11
Cumulative error distribution curves of 3D face alignment accuracy on the ALFW2000 dataset. Compared with PRNet [73], and 3ddfa [7], our method produces better results.

Conclusions
We have proposed a novel 3D morphable model with a hypersphere manifold shape parameter space for face generation. We have also proposed a two-stage training framework where both 3D and 2D data were utilized. Our model outperformed previous models on the consistency and the fidelity of the reconstructed faces. Experimental results validated that our method is superior to previous methods objectively, and user study showed that our model can provide visually better face reconstruction results.