Shape My Face: Registering 3D Face Scans by Surface-to-Surface Translation

Existing surface registration methods focus on fitting in-sample data with little to no generalization ability and require both heavy pre-processing and careful hand-tuning. In this paper, we cast the registration task as a surface-to-surface translation problem, and design a model to reliably capture the latent geometric information directly from raw 3D face scans. We introduce Shape-My-Face (SMF), a powerful encoder-decoder architecture based on an improved point cloud encoder, a novel visual attention mechanism, graph convolutional decoders with skip connections, and a specialized mouth model that we smoothly integrate with the mesh convolutions. Compared to the previous state-of-the-art learning algorithms for non-rigid registration of face scans, SMF only requires the raw data to be rigidly aligned (with scaling) with a pre-defined face template. Additionally, our model provides topologically-sound meshes with minimal supervision, offers faster training time, has orders of magnitude fewer trainable parameters, is more robust to noise, and can generalize to previously unseen datasets. We extensively evaluate the quality of our registrations on diverse data. We demonstrate the robustness and generalizability of our model with in-the-wild face scans across different modalities, sensor types, and resolutions. Finally, we show that, by learning to register scans, SMF produces a hybrid linear and non-linear morphable model that can be used for generation, shape morphing, and expression transfer through manipulation of the latent space, including in-the-wild. We train SMF on a dataset of human faces comprising 9 large-scale databases on commodity hardware.

: Sample test scans and their registration. Left to right: textured mesh, input point cloud sampled uniformly from the mesh (black) and the attention mask predicted by the model (green), registration, and heatmap of the surface error. and meshes. Human face scans, in particular, are often given as either range images, or meshes, but typically do not share a common parameterization (i.e., the output of the 3D scanner does not typically have a fixed connectivity, sampling rate etc.). Fundamentally, this diversity of representations is only a by-product of the inability of computers to represent continuous surfaces, but the latent geometric information to be represented is the same. In practice, this poses a challenge: two surfaces represented with two different parameterizations are not easily compared, which makes exploiting the geometric information difficult. Finding a shared representation while preserving the geometry is the task of dense surface registration, a cornerstone in both 3D computer vision and graphics (Amberg et al., 2007;Salazar et al., 2014).
The design and construction of a shared shape representation is often implemented by means of a common template, which has a predefined number of vertices and vertex connectivity. After choosing the common template, a fitting method is implemented to bring the raw facial scans in dense correspondence with the chosen template. The use of a common template is a crucial step towards learning a statistical model of the face shape, also know as 3D Morphable Models (3DMMs) (Blanz and Vetter, 1999;Booth et al., 2016), which is a very important tool for shape representation and has been used for a wide range of applications spanning from 3D face reconstruction from images (Blanz and Vetter, 2003;Booth et al., 2018b) to diagnosis and treatment of face disorders (Knoops et al., 2019;Mueller et al., 2011).
Arguably, the current methods of choice for establishing dense correspondences are variants of Non-rigid Iterative Closest Point (NICP) (Amberg et al., 2007), and non-rigid registration approaches whose regularization properties are defined by statistical (Cheng et al., 2017) and non-statistical (Lüthi et al., 2018) models. The application of deep learning techniques to the problem of establishing dense correspondences was only recently possible after the design of proper layered structures that directly consumes point clouds and respect the permutation invariance of points in the input data (e.g., PointNet (Qi et al., 2017a)).
To the best of our knowledge the only technique that tries to solve the problem of establishing dense correspondences on unstructured point-cloud data and learning a face model on a common template has been presented in Liu et al. (2019). The method uses a PointNet to summarise (i.e., encode) the information of an unstructured facial point cloud. Then, fullyconnected layers (similar to the ones used in dense statistical models (Blanz and Vetter, 1999;Booth et al., 2016)) are used to reconstruct (i.e., decode) the geometric information in the topology of the common template. In this paper, we work on a similar line of research and we make a series of important contributions in three different areas. In particular, -Network architecture. We propose architectural modifications of the point cloud CNN framework that improve on restrictions of Qi et al. (2017a). That is, in order to avoid having to adopt heuristic noise reduction and cropping strategies we incorporate a learned attention mechanism in the network structure. We demonstrate that the proposed architecture is better suited for in-the-wild captured data. Furthermore, we propose a variant of PointNet better suited for small batches, hence able to consume higher resolution raw-scans. Our morphable model part of the network (i.e., the decoder) comprises of a series of mesh-convolutional layers (Bouritsas et al., 2019;Gong et al., 2019) with novel (in the mesh processing literature) skip connections that can capture better details and local structures. Finally, our network structure is also considerably smaller than the state-of-the-art. -Engineering/Implementation. One of the major challenges when establishing dense correspondences in raw facial scans is the large deformations of the mouth area, especially in extreme expressions. We propose a very carefully engineered approach that smoothly incorporates a statistical mouth model. We demonstrate our method captures the mouth area very robustly. -Application. Our emphasis in this work is on robustness to noise in the scans (e.g. sensor noise, background contamination, and points from the inside of the mouth), compactness of the model, and generalization. The model we develop should be readily usable on, e.g., embedded 3D scanners to produce both a registered scan and a set of latent representations that can be leveraged in downstream tasks. We present extensive experiments to demonstrate the power of our algorithm, such as expression transfer and interpolation between in the wild scans across modalities and resolution. One of the major outcomes of our paper is a novel morphable model trained on 9 diverse large scale datasets, which will be made public. Figure 1 shows some test textured scans and their corresponding registrations and attention masks.

Structure of the Paper
We provide an extensive summary of prior published work in Section 2, covering relevant areas of the morphable models, registration, and 3D deep learning literature. Section 3 is dedicated to reviewing the current state of the art model, which we use as a baseline in our experiments, and to highlight the limitations and challenges we tackle. We introduce our model, Shape My Face (SMF) in Section 4, and provide detailed descriptions of its different components, how they provide solutions to the challenges identified in Section 3, and how they allow us to frame the registration task as a surface-to-surface translation problem. We also introduce our model trained on a very large dataset comprising 9 large human face scans databases. For the sake of clarity, we split our experimental evaluation into two parts. Section 5 studies the performance of SMF for registration, and presents a statistical analysis of the model's stability, as well as an ablation study. Section 6 evaluates SMF on morphable model applications and studies properties of the latent representations; in particular, in Section 6.4 we evaluate SMF on surface-to-surface translation applications entirely in the wild.
Notations Throughout the paper, matrices and vectors are denoted by upper and lowercase bold letters (e.g., X and (x), respectively. I denotes the identity matrix of compatible dimensions. The i th column of X is denoted as x i . The sets of real numbers is denoted by R. A graph G = (V , E ) consists of vertices V = {1, . . . , n} and edges E ⊆ V × V . The graph structure can be encoded in the adjacency matrix A, where a i j = 1 if (i, j) ∈ E (in which case i and j are said to be adjacent) and zero otherwise. The degree matrix D is a diagonal matrix with elements d ii = ∑ n j=1 a i j . The neighborhood of vertex i, denoted by N (i) = { j : (i, j) ∈ E }, is the set of vertices adjacent to i.

Related Work
Although primarily a fast registration method with a focus on generalizability to unseen data, our approach also makes important progress towards learning an accurate part-based non-linear 3D morphable model of the human face, as well as a generative model with applications to surface-to-surface translation. We first review the relevant literature across the related fields. Then, we devote Section 3 to exposing the limitations of the current state of the art algorithm that motivate the choices made in this work.

Surface Registration and Statistical Morphable Models
Surface registration is the task of finding a common parameterization for heterogeneous surfaces. It is a necessary preprocessing step for a range of downstream tasks that assume a consistent representation of the data, such as statistical analysis and building 3D morphable models. As such, it is a fundamental problem in 3D computer vision and graphics.

Surface registration
Two main classes of methods coexist for surface registration. Image-based registration methods first require finding a mapping between the surface to align and a two-dimensional parameter space; most commonly, a UV parameterization is computed for a textured mesh, typically using a cylindrical projection. Image registration methods are then applied to align the unwrapped surface with a template, for instance using optical flow analysis (Horn and Schunck, 1981;Lefébure and Cohen, 2001), or thin plate spline warps (Bookstein, 1989). UV-space registration is computationally efficient and relies on mature image processing techniques, but the flattening step unavoidably leads to a loss of information, and sampling of the UV space is required to reconstruct a surface. For this reason, the second main class of surface registration methods operates directly in 3D, avoiding the UV space entirely. Prominent examples include the Non Rigid Iterative Closest Point (NICP) method (Amberg et al., 2007), a generalization of the Iterative Closest Point (ICP) method (Chen and Medioni, 1991;Besl and McKay, 1992) that introduces local deformations, or the Coherent Point Drift (CPD) algorithm (Myronenko et al., 2007;Myronenko and Song, 2010). NICP operates on meshes and solves a non-convex energy minimization problem that encourages the vertices of the registered mesh to be close to the target surface, and the local transformations to be similar for spatially close points. Due to its non-convex nature, NICP is sensitive to initialization, and is most often used in conjunction with sparse annotations (i.e. landmarks for which a 1-to-1 correspondence is known a priori). Similarly, CPD also encourages the motion of neighboring points to be similar, but operates on point clouds and frames the registration problem as that of mass matching between probability distributions. As such, it is closely related to optimal transport registration (Feydy et al., 2017). We refer to relevant surveys (van Kaick et al., 2011;Tam et al., 2013) for a more complete review of non-deep learning based surface registration methods.

Linear and multilinear morphable models
Linear morphable models for the human face were first introduced in the seminal work of Blanz and Vetter (1999).
The authors proposed to model the variability of human facial anatomy by applying Principal Component Analysis (PCA) (Pearson, 1901;Hotelling, 1933) to 200 laser scans (100 male and 100 female) of young adults in a neutral pose. Scans were aligned by image registration in the UV space with a regularized form of optical flow. The resulting set of components forms an orthogonal basis of faces that can be manipulated to synthesize new faces. Amberg et al. (2008) extended the PCA approach to handle expressions for expression invariant 3D face recognition, using scans registered directly with NICP (Amberg et al., 2007). Patel and Smith (2009) introduced the widely-used Basel Face Model (BFM), also trained on 200 scans registered with NICP. It is only with the work of Booth et al. (2016Booth et al. ( , 2018a) that a morphable model trained on a large heterogeneous population, known as the Large Scale Face Model (LSFM) was made available. The authors use the BFM template and a modification of the NICP algorithm, along with automated pruning strategies, to build a high quality model of the human face from almost 10000 subjects. LSFM is trained on neutral scans only, but can be combined with a bank of facial expressions, such as the popular FaceWarehouse (Cao et al., 2014).
Multilinear extensions of linear morphable models have been considered as soon as Vlasic et al. (2005) where a tensor factorization was used to model different modes of variation independently (e.g., identity and expression) with applications to face transfer, and refined by Bolkart and Wuhrer (2015). However, the multilinear approach requires every combination of subject and expression to be present exactly once in the dataset, a requirement that can be both hard to satisfy and limiting in practice. Salazar et al. (2014) proposed an explicit decomposition into blendshapes as an alternative.

Part-based models
Besides a global PCA model, Blanz and Vetter (1999) also presented a part-based morphable model. The authors manually segmented the face into separate regions and trained specialized 3DMMs for each part, that can then be morphed independently. The resulting model is more expressive than a global PCA would be, and is obtained by combining the parts using a modification of the image blending algorithm of Burt and Adelson (1985). De Smet andVan Gool (2011) andTena et al. (2011) showed manual segmentation may not be optimal, and that better segmentation can be defined by statistical analysis. Tena et al. (2011) designed an interpretable regionbased model for facial animation purposes. In Li et al. (2017), the authors propose combine an articulated jaw with linear blending to obtain a non-linear model of facial expressions.
Part-based models also appear when attempting to represent together different distinct parts of the body. Romero et al. (2017) model hands and bodies together by replacing the hand region of SMPL (Loper et al., 2015) with a new specialized hand model called MANO. Joo et al. (2018) present the Frankenstein model, a morphable model of the whole human body that combines existing specialized models of the face (Cao et al., 2014), body (Loper et al., 2015), and a new artist-generated model for hands. The model's parameters are defined as the concatenation of all the parts' parameters. The final reconstruction is obtained by linear blending of the vertices of the separate parts using a manually-crafted matrix. The final model has fewer vertices than the sum of its parts, and the parts were manually aligned. As per the author's own description, minimal blending is done at the seams.
In Ploumpis et al. (2019Ploumpis et al. ( , 2020, a high-definition head and face model is created by blending together the Liverpool-York Head model (LYHM) (Dai et al., 2017) and the Large-Scale Face Model (LSFM) (Booth et al., 2018a). While LYHM includes a facial region, replacing it with LSFM offers more details. Two approaches are proposed to combine the models smoothly. A regression model learned between the two models' parameter spaces, and a Gaussian Process Morphable Model (GPMM) approach (Luthi et al., 2018) where the covariance matrix of a GPMM is carefully crafted from the covariance matrices of its parts using a weighting scheme based on the Euclidean distance of the vertices to the nose tip of the registered meshes (i.e. the outputs of the head and face models). A refinement phase involving non-rigid ICP further tunes the covariance matrix of the GPMM.
We refer the interested reader to the recent review of Egger et al. (2020) for more information.

Deep Learning on Surfaces
Deep neural networks now permeate computer vision, but have only become prominent in 3D vision and graphics in the past few years. We review some of the recent algorithmic advances for representation learning on surfaces, surface registration, and morphable models.

Geometric deep learning on point clouds and meshes
Recent methods from the field of Geometric Deep Learning  have emerged and propose analogues of classical deep learning operations such as convolutions for meshes and point clouds.
Point cloud processing methods treat the discrete surface as an unordered point set, with no pre-defined notion of intrinsic distances or connectivity. The pioneering work of PointNet (Qi et al., 2017a) defines a point set processing layer as a 1×1 convolution shared among all points, followed by batch normalization, and ReLU activation. The resulting local point-wise features are aggregated into a global representation of the surface by max pooling. In spite of its simplicity, PointNet achieved state of the art result in both 3D object classification and point cloud segmentation tasks, and remains competitive to this day. Follow-up works have explored extending PointNet to enable hierarchical feature learning (Qi et al., 2017b), as well as more powerful architectures that attempt to learn the metric of the surface via local kernel functions (Xu et al., 2018;Lei et al., 2019;Zhang et al., 2019), or by building a k-NN graph in the feature space . While these methods obtain higher classification and segmentation accuracy, their computational complexity limits their application to large-scale point clouds, a task for which PointNet is often preferred.
Graph Neural Networks, on the other hand, assume the input to be a graph, which naturally defines connectivity and distances between points. Initial formulations were based on the convolution theorem and defined graph convolutions using the graph Fourier transform, obtained by eigenanalysis of the combinatorial graph Laplacian (Bruna et al., 2014), and relied on smoothness in the spectral domain to enforce spatial locality. Defferrard et al. (2016) accelerated spectral graph CNNs by expanding the filters on the orthogonal basis of Chebyshev polynomials of the graph Laplacian, also providing naturally localized filters. However, the Laplacian is topology-specific which hurts the performance of these methods when a fixed topology cannot be guaranteed. Kipf and Welling (2017) further simplified graph convolutions by reducing ChebNet to its first order expansion, merging trainable parameters, and removing the reliance on the eigenvalues of the Laplacian. The resulting model, GCN, has been shown to be equivalent to Laplacian smoothing  and has not been successful in shape processing applications. Attention-based models (Monti et al., 2017;Fey et al., 2018;Verma et al., 2018;Veličković et al., 2018) dynamically compute weighted features of a vertex's neighbours and do not expect a uniform connectivity in the dataset, and generalize the early spatial mesh CNNs that operated on pre-computed geodesic patches (Masci et al., 2015;Boscaini et al., 2016). Spatial and spectral approaches have both been shown to derive from the more general neural message passing (Gilmer et al., 2017) framework. Recently, SpiralNet (Lim et al., 2018), a specialized operator for meshes, has been introduced based on a consistent sequential enumeration of the neighbors around a vertex. Gong et al. (2019) introduces a refinement of the SpiralNet operator coined SpiralNet++ which simplifies the computation of the spiral patches. Mathematically, SpiralNet++ reads with γ (k) an MLP, || the concatenation, and S(i, M) the spiral sequence of neighbors of i of length (i.e. kernel size) M. Finally, recent work explored skip connections to help training deep graph neural networks. In Appendix B of Kipf and Welling (2017), the authors propose a residual architecture for deep GCNs. Hamilton et al. (2017) introduces an architecture for inductive learning on graphs based on an aggregation step followed by concatenation of the previous feature map and transformation by a fully-connected layer.  study very deep variants of the Dynamic Graph CNN  using residual and dense connections for point cloud processing. Finally, in Gong et al. (2020), the authors relate graph convolution operators to radial basis functions to propose affine skip connections, and demonstrate improved performance compared to vanilla residuals for a range of operators.

Registration
The methods presented in Section 2.1.1 are framed as optimization problems that need to be solved for every surface individually. Although able to produce highly accurate registrations, they can be costly to apply to large datasets, and are based on axiomatic conceptualizations of the registration task. The reliance on sparse annotations to accurately register expressive scans also means the data needs to be manually annotated, a tedious and expensive task. A new class of learning-based surface registration models is therefore emerging that, once passed the initial training effort, promise to reduce the registration of new data to a fast inference pass, and to potentially outperform hand-crafted algorithms. In PointNetLK (Aoki et al., 2019), the authors adapt the image registration of Lucas and Kanade (1981) to point clouds in a supervised learning setting. A PointNet (Qi et al., 2017a) encoder is trained to predict a rigid body transformation G ∈ SE(3), with a loss defined between the network's prediction G est and a ground truth transformation G gt as ||G −1 est G gt − I|| F , with ||.|| F the Frobenius (matrix 2 ) norm. A similar technique is employed in Wang and Solomon (2019a), where the authors introduce a supervised learning model for rigid registration coined as Deep Closest Point (DCP). DCP learns to predict the parameters of a rigid motion to align two point clouds, and is trained on synthetically generated pairs of point clouds, for which the ground truth parameters are known. The follow-up work of PRNet (Wang and Solomon, 2019b) offers a self-supervised approach for learning rigid registration between partial point clouds. In Lu et al. (2019), and Li and Zhang (2019), supervised learning algorithms are defined for rigid registration, but with losses defined on dense correspondences between points, and on a soft-assigment matrix, respectively. Finally, Shimada et al. (2019) designed a U-Net like architecture on voxel grids for non-rigid point set registration, however, their method is limited by the resolution of the grid and does not build latent representations of the scans, nor does it provide a morphable model.

Morphable models
Abrevaya et al. (2018) train a hybrid encoder-decoder architecture on rendered height maps from 3D face scans using an image CNN encoder and a multilinear decoder. This approach circumvents the need for prior registration of the scans to a template, but the face model itself remains linear.
Concurrently, there has been a surge of interest for deep non-linear morphable models to better capture extreme variations. Bagautdinov et al. (2018) model facial geometry in UV space with a variational auto-encoder (VAE). Tran and Liu (2018) replace the linear bases with fully-connected decoders to model 3D geometry and texture from images, a technique extended in Tran et al. (2019). Ranjan et al. (2018) introduce a convolutional mesh auto-encoder based on Chebyshev graph convolutions (Defferrard et al., 2016). Bouritsas et al. (2019), use Spiral Convolutions (Lim et al., 2018 to learn non-linear morphable models of bodies and faces. In both these works, the connectivity of the 3D meshes is assumed to be fixed; that is, the scans have to be registered a priori. The non-linear deep neural network replaces the PCA for dimensionality reduction. In Liu et al. (2019), an asymetric autoencoder is proposed. A PointNet encoder is applied to rigidly aligned heterogeneous raw scans, and two fully-connected decoders produce identity and expression blendshapes independently on the BFM face template. Thus, the algorithm produces a registration of the input scan. Mesh convolutional decoders are proposed in Kolotouros et al. (2019b) for human body reconstruction from single images. In Kolotouros et al. (2019a), model-fitting is introduced to also produce representations directly on the SMPL model.

State of the Art
The autoencoder architecture of Liu et al. (2019) is the current state of the art for the learned registration of 3D face scans. A learning-based approach for registration is desirable since a model that generalizes would be able to register new scans very quickly, thus potentially offsetting the time spent training the model. Other benefits compared to traditional optimization-based registration may include increased robustness to noise in the data. Furthermore, an autoencoder learns an efficient latent representation of the scans, which may later be processed for other applications, while the trained decoder can be used in isolation as a morphable model.
Motivated by the aforementioned potential upsides, we review the approach of Liu et al. (2019) and identify key limitations and areas of improvement. We further evaluate a pre-trained model provided by the authors of Liu et al. (2019) on the same dataset used in the original paper (also provided by the authors). We refer to the provided pre-trained model as the baseline.

Problem formulation and architecture
A crop of the mean face of the BFM 2009 model is chosen as a face template on which to register the raw 3D face scans. A registered (densely aligned) face is modeled as an identity shape with an additive expression deformation: the concatenated, consistently ordered, Cartesian 3D coordinates of the vertices. For this template, N = 29495. A subset of N s vertices from a processed input scan (details of the processing below) are sampled at random to obtain a point cloud representation of the scan. A vanilla PointNet encoder without spatial transformers produces a joint embedding z joint ∈ R 1024 . Two fully-connected (FC) layers, without non-linearities, are applied in parallel to obtain identity and expression latent vectors in R 512 : Two multi-layer perceptrons consisting of two fully-connected layers with ReLU activations decode the identity and expression blendshapes from their corresponding vectors: with ξ (x) = max(0, x) the element-wise ReLU non-linearity. Both decoders are symmetric, with FC 1 (·) : R 512 → R 1024 and FC 2 (·) : R 1024 → R 3 .

Training data
The training data is formed from seven publicly available face datasets of subjects from a wide range of ethnic backgrounds, ages, and gender, as well as a set of synthetic 3D faces. Table  1 summarizes the exact composition of the training set. Real scans Both neutral and expressive scans are kept, and the data is unlabeled. The data was processed by first converting the scans to textured meshes using simple processing steps, e.g. Delaunay triangulation of the depth images. Automatic keypoint localization was applied on rendered frontal views of the scans to detect facial landmarks. The 2D landmarks were back-projected on the raw textured mesh using the camera parameters. The cropped BFM template was annotated with matching landmarks, such that Procrustes analysis could be applied to find a similarity transformation to align the raw scan with the template.
Pre-processing In Liu et al. (2019), the authors applied cropping to remove points outside of the unit sphere originating at the tip of the nose of the subject. The authors also applied mesh subdivision to obtain denser ground-truth meshes, thereby facilitating the sub-sampling of 29495 vertices from scans with insufficient native resolution. Finally, the sampling of points from the scans for training was done at the pre-processing stage. Data augmentation was carried out by randomly sampling vertices from some scans several times and storing the different point clouds separately. Since the synthetic scans are, by nature, in correspondence with the BFM template, Liu et al. (2019) use the element-wise 1 norm to train with supervision. For real scans, self-supervised training is carried out to minimize the Chamfer distance between the output S of the decoder and the potentially subdivded ground-truth scan.

Losses and training procedure
Additional losses are used for synthetic and real scans. Edge-length loss is applied to regularize the topology of the reconstruction. For real scans, the edge-lengths in the output are regularized towards those of the template. For synthetic scans, the edge-length loss is applied as a function of the different between the edge-length of the input and the output meshes. Normal consistency is used for vertex normals. Due to the presence of noise in the raw scans in the mouth region (points from the inside of the mouth, teeth, or tongue), Laplacian regularization is applied to penalize large changes in curvature in a pre-defined mouth region on the BFM template.
The autoencoder is trained in successive phases. First, only the identity decoder is trained on the synthetic data only, then on a combination of synthetic and real data. After 10 epochs, the identity decoder and the fully-connected layer of the identity branch of the encoder are frozen (i.e. backpropagation is disabled) and the expression decoder is trained on synthetic data alone, and then on a mixture of synthetic and real data. Finally, both decoders and encoder branches are trained simultaneously on both synthetic and real scans. We refer the reader to the original work for details.

Limitations
We now study the limitations of the approach.

Data processing and representation
Cropping Although cropping is a simple solution to remove unnecessary parts of the scans, we argue relying on it makes the method less robust. Cropping points outside of the unit sphere centered at the tip of the nose is affected by the quality of the landmark detection. Similarly, choosing the unit sphere centered at the origin of the ambient space will be affected by the location of the scan in R 3 . In both cases, even though it is systematic, cropping is inconsistent: as the method is not adaptive, there is no guarantee that the noise (i.e. the points that do not contribute to a better face reconstruction and could even degrade the performance) will be discarded. In particular, for range scans such as those from the FRGC (Phillips et al., 2005), Bosphorus (Savran et al., 2008) and Texas 3D (Gupta et al., 2010) datasets, spikes an irregularities are commonly observed due to sensor noise, as shown in Figure 2. Median filtering has traditionally been applied to the depth images before conversion to 3D surfaces as a means to alleviate this issue (Gupta et al., 2010), but incurs additional human intervention and might cause a loss of details. Cropping would not remove spikes, nor would it discard other irrelevant points if contained within the unit sphere. At the same time, cropping might discard points that would have contributed to the face region.
Subdivision scheme and vertex subsampling In Liu et al. (2019), mesh subdivision was used to improve the accuracy of the dense correspondences (i.e. provide more ground truth points for the Chamfer loss), and to enable consistent sampling of 29495 vertices for the input point cloud, even from low-resolution face scans that might not have enough remaining vertices in the facial region after cropping (e.g. most scans from the BU-3DFE database (Yin et al., 2006)). The authors then sampled 29495 vertices at random from the (subdivided) mesh to obtain a point cloud.
Subdivision schemes do not introduce additional details in the scan, but create a denser triangulation from existing triangles. The amount of memory required to store the same geometry is thus largely increased. Figure 3 illustrates the refinement step of the Loop scheme used by Liu et al. (2019). Assuming we started with one triangle and applied the scheme twice, the figure on the left in Figure 3 shows the result after one subdivision step, and the figure on the right the result after two such steps. We can see that after one step, no vertices were introduced inside of the original triangle: all of the new vertices are located on its edges. After two steps, only 3 vertices have been placed inside the original triangle, yet the number of vertices has been multiplied by 5. In practice, two subdivision steps is the maximum that would be applied due to the rapid increase in memory required to store the subdivided meshes. It is therefore apparent that a point cloud sampled uniformly at random from the vertices of the mesh cannot -in general -yield a uniform coverage of the surface, even after several mesh subdivision steps. Moreover, using the (subdivided) mesh as a ground truth in the Chamfer loss biases the reconstruction: closest points for vertices of the reconstructed mesh will either never be found inside the triangles of the scan, or in an unfavorable ratio when at least two subdivision steps have been applied. Liu et al. (2019) sampled one point cloud per expression scan, and at most ten point clouds per neutral scan, per subject. As this is done during preprocessing, all samples must be stored individually. No other data augmentation or transformation (e.g. jittering) was used.

Point cloud representation
To avoid overfitting to a particular sampling of a given surface, we argue that as many different point clouds as possible should be presented to the model for each mesh.

Architectural limitations and conclusion
We review the limitations of the two main blocks of the algorithm of Liu et al. (2019), and conclude the section.
Decoder While MLP decoders are powerful and fully capable of representing details, they do not take advantage of the known template topology. In fact, careful tuning is required to obtain topologically sound shapes: Liu et al. (2019) rely on a strong edge length prior, and use synthetic data extensively during training to condition both the encoders and decoders to respect the topology of the template.
We observe significant artifacts for a large portion of the input scans, as shown in Figure 4. Notably, we observe tearing-like artifacts and self-intersecting edges, as well as excessive roughness and ragged edges at the boundaries of the shape. In particular, heavy artifacting is present in the mouth region despite the use of the Laplacian loss. Such registrations cannot be exploited for downstream tasks (such as learning from or statistical analysis on the registered scans) without heavy post-processing to correct the artifacts and improve surface fairness.
Encoder A vanilla PointNet (Qi et al., 2017a) layer consists of a 1 × 1 convolution, followed by batch normalization and a ReLU activation, as shown in Figure 5a. Choosing N s = N facilitates mixed batching of synthetic and real scans, but according to Liu et al. (2019), the optimal batch size for the model was found experimentally to be 1. As batch normalization is known to result in degraded performance for small batch sizes (Wu and He, 2020), we therefore investigate possible improvements.
Number of parameters While the PointNet encoder used in Liu et al. (2019) enables a high degree of weight sharing, the fully-connected decoders use dense fully-connected layers. This design choice results in a high number of parameters (183.6M), which, combined with the limited data augmentation and absence of regularization, promotes overfitting.
Conclusion The reliance on subdivision and cropping, the high number of trainable parameters, as well as the training methodology utilised, make the method of Liu et al. (2019) only suitable for in-sample registration, and thus the fast inference time does not fully offset the offline training time.
The presence of significant noise and artifacts on registrations of scans from the training set further limits the applicability of the model on its own.

Description of the Method
We now introduce Shape My Face, our registration and morphable model pipeline. Our approach is based on the idea that of the reconstructions to be close in position and topology to that of the LSFM mean face. We propose a parameter-free approach for achieving high quality mouth reconstructions by reconstructing a crop of the mouth region on a small mouthspecific PCA model, and blending the reconstruction with the shapes predicted by the decoders using a smooth blending mask derived from the geodesic distance of the vertices in the template to a small crop of the lips (c).
registration can be cast as a translation problem, where one seeks to faithfully translate a latent geometric information (the surface) from an arbitrary input modality to a controlled template mesh. It is therefore natural to adopt an autoencoder architecture, with the advantages exposed in Section 3. We also wish to ensure our model is compact and performs reliably and satisfyingly on unseen data. The emphasis is, therefore, on robustness and applicability to real-world data, potentially on the edge.

Preliminaries and Stochastic Training
We choose the mean face of the LSFM model to be our template. We manually cropped the same facial region as the template of Liu et al. (2019) from a full-face combined LSFM and FaceWarehouse morphable model, and ensured a 1-to-1 correspondence between vertices. We choose LSFM since it is more representative of the mean human face than the BFM 2009 mean, and to facilitate the prototyping of a mouth model, as explained in Section 4.4. We adopt a formulation in terms of blendshapes and define the output of our network to be Where µ µ µ is the template mean face shown in Figure 6a, and ∆ ∆ ∆ id and ∆ ∆ ∆ exp are identity and expression deformation fields, respectively, defined on the topology of µ µ µ. We motivate this choice to encourage better disentanglement by modeling both identity and expression as additive deformations of a plausible mean human face.  Figure 7: Flow-chart representation of our approach: We sample 2 16 points uniformly at random on the surface of the scan to register. A modified PointNet encoder predicts per-point features as well as an attention mask. We apply the mask element-wise on the point features before building the global latent representation z joint of the shape. From z joint , two hyperpsherical embeddings z id and z exp are produced. We decode identity and expression blendshapes separately using mesh convolution decoders with novel skip connections. To ensure noise-free reconstructions of the mouth region, we project the mouth region in the blendshape to a specialized PCA model of the mouth. The PCA reconstruction is blended smoothly with the output of the mesh convolution using a blending mask based on the geodesic distance to the inside of the mouth. Finally, the two blendshapes are summed with our template mean face to obtain the registration. In training (dotted lines), we measure the fit of the registration between the output of the network and the dynamically sampled input point cloud. This ensures vertices of the reconstruction can be matched to points anywhere on the surface of the scan, and not only to the vertices.
We follow an encoder-decoder architecture using a point cloud encoder and two symmetric non-linear decoders for the identity and expression blendshapes. As we will develop further, we propose a novel approach to avoid mouth artifacts by blending the non-linear blendshapes smoothly with linear blendshapes of the mouth region (defined based on the geodesic radius from the inside of the mouth). The flowchart of the method is presented in Figure 7.
Contrary to Liu et al. (2019), we do not apply any further processing on the 3D scans after rigid alignment. In particular, no surface subdivision and no offline sampling for data augmentation are done. For both training and inference, we only assume we are given the raw scans as meshes (a point cloud is enough for inference, as shown in Section 6.4, though any input modality from which we can sample points on the surface would be suitable for inference, and also for training with the addition of point normals), and that these scans are rigidly aligned (with scaling) with the template.
We dynamically sample N s = 2 16 = 65536 points uniformly at random on the surface of the input mesh using a triangle weighting scheme. Furthermore, we use the sampled point cloud as ground truth in the Chamfer loss. This ensures the vertices of the registration can be matched to points any-where on the input surface, including inside triangles where the true projection of the vertices of the registration are more likely to lie.
We denote the triangulated raw input scan by the tuple (S in , T in ), where S in is the set of vertices of the mesh, and T in the triangles. We write P in the point cloud dynamically sampled on the surface of (S in , T in ), and N in the associated sampled point normals.
We use both synthetic and real scans in training. The training procedure is detailed in Section 4.6.

Encoder and attention
In PointNet (Qi et al., 2017a), the authors introduce one of the first CNN architectures for point clouds. A PointNet layer consists of a 1 × 1 convolution followed by batch normalization and a ReLU activation, as shown in Figure 5a. PointNet showed high performance on classification and segmentation tasks using moderately dense point clouds as input (2048 points for the ModelNet40 meshes). In this work, we sample 2 16 = 65536 points from the input scans, which limits the batch sizes that can be accommodated with a single GPU implementation. As mentioned in Section 3.4.2, batch normalization is known to be ineffective for small batches (Wu and He, 2020), as the sample estimators of the feature mean and standard deviation become noisy. We therefore propose modified PointNet layers with group normalization (Wu and He, 2020), that we choose to apply after the ReLU non-linearity. Our modified PointNet layers are illustrated in Figure 5b. We denote by PN( f in , f out , g) the block consisting of a 1 × 1 convolution with f in input features and f out output features, followed by one ReLU activation, and group normalization with group size g. The sequence of point convolutional layers in our encoder can thus be writ- Visual attention To improve the robustness of our method to noise and variations in the physical extent of the scans, we introduce a novel visual attention mechanism implemented as a binary-classification PointNet sub-network applied to the features of the last PointNet layer and before the maxpooling operation. This can be seen as a form of regionproposal (He et al., 2017) or segmentation sub-network followed by a gating mechanism. We use our modified Point-Net layers and obtain the following sequence of operations PN(1024, 128, 4) → PN(128, 32, 4) → Conv1 ×1(32, 1). We use a smaller group size of 4 for group normalization to discourage excessive correlation in the features. The logits obtained as output of the attention sub-network are converted to a smooth mask by applying the sigmoid function and used as gating values to the max pooling operation -controlling which points are used to build the global latent representation z joint ∈ R 1024 for the scan.
Hyperspherical embeddings Two dense layers predict separate identity and expression embeddings from z joint . We choose z id , z exp ∈ R 256 . Contrary to Liu et al. (2019), the mapping is non-linear: we normalize the identity and expression vectors, such that they lie on the hypersphere S 255 . Hyperspherical embeddings have been successful in image-based face recognition Wang et al. (2018); Deng et al. (2019) and shown to improve clusterability (Aytekin et al., 2018). Additionnally, we found the normalization to improve numerical stability during training.
The full encoder can be summarized as follows: where denotes the element-wise (Hadamard) product.

Up
Conv ELU Concat ELU FC Figure 8: One Mesh Inception block: Our mesh convolution block offers two paths for the information to flow from one resolution to the next. We concatenate the activated feature map of the current convolution layer with the upsampled feature map of the previous layer. The features are combined in a learnable way by a fully connected layer followed by another ELU activation.

Mesh convolution decoders
As developed in Section 3.4.2, the fully-connected decoders used in Liu et al. (2019) suffer from two main challenges. First, they employ a high number of parameters, which promotes overfitting. Second, they do not leverage the known template topology, and therefore require heavy tuning and regularization to produce topologically sound shapes, and are highly prone to failure.
We propose non-linear decoders based on mesh convolutions. Our method is applicable to any intrinsic convolution operator on meshes . In this particular implementation, we use the SpiralNet++ operator presented in Equation 1. We observed training was difficult with the vanilla operators. As some operators such as SpiralNet++ and ChebNet already have a form of residual connections built-in (the independent weights given to the center vertex of the neighborhood), vanilla residuals or the recently-proposed affine skip connections (Gong et al., 2020) would be redundant. We instead propose a block reminiscent of the inception block in images (Szegedy et al., 2015) that can benefit any graph convolution operator. We concatenate the output of the previous upsampled feature map with the output of the convolution after an ELU non-linearity (Clevert et al., 2016). The concatenated feature maps are combined and transformed to the desired output dimension using an FC layer followed by another ELU non-linearity, as illustrated in Figure 8.
We found this technique to drastically improve convergence and details in the reconstructed shapes. The technique is comparable to the successful GraphSAGE (Hamilton et al., 2017) algorithm, using graph convolutions followed by ELU as the aggregation function, and ELU non-linearities. We refer to our block as Mesh Inception.
For upsampling, we follow the approach of Ranjan et al. (2018). We decimate the template four times using the Qslim method (Garland and Heckbert, 1997) and build sparse upsampling matrices using barycentric coordinates. We set the kernel sizes of our convolution layers to 32, 16, 8, and 4, starting from the coarsest decimation of the template.

Mouth model and blending
Though the raw scans are rigidly aligned with the template on 5 facial landmarks that include the two corners of the mouth , the mouth expressions introduce a high level of variability in the position of the lips. Additionally, numerous expressive scans include points captured from the tongue, the teeth, or the inside of the mouth. This noise and variability in the dataset makes finding good correspondences for the mouth region difficult and leads to severe artifacting in the form of vertices from the lips being pulled towards the center of the mouth. In Liu et al. (2019), the authors advocate for the use of Laplacian regularization to prevent extreme deformations by penalizing the average mean curvature over a pre-defined mouth region, controlled by a weight λ Lap . While this shows some success, we experimentally observed that, for small to moderate values of λ Lap , artifacts remained. As shown in Figure 9, while artifacts were reduced for large values of λ Lap , so was the range of expressions.
In this work, we introduce a new approach based on blending a specialized linear morphable model with the nonlinear face model. We first isolate a small set of vertices, S inner , from the innermost part of the lips of the cropped LSFM mean face, as shown in Figure 6c. We then compute the geodesic distance from S inner to all vertices of the template using the heat method with intrinsic Delaunay triangulation Crane et al. (2017), which is visualised in Figure  10a. We redefine the mouth region to be the set of vertices S mouth within a given geodesic radius d from S inner . By visual  Figure 6c, we compute the geodesic distance of all vertices of the template to the vertices of the crop S inner (a). We define the mouth region as the vertices within a chosen geodesic radius of S inner (c). We define the blending mask as a function of the geodesic distance, shown as a heatmap in (b).
inspection, we choose d = 0.15. The resulting mouth region is shown as a point cloud in Figure 10c.
To obtain a linear morphable model of this mouth region, we cropped the PCA components of the full face LSFM and FaceWarehouse model whose mean we used to obtain our face template. We keep only a subset, W id , of 30 identity components (from LSFM) and a subset, W exp , of 20 expression components (from FaceWarehouse). While it is well known that computing PCA on the cropped region of the raw data leads to more compact bases (Blanz and Vetter, 1999;Tena et al., 2011), re-using the LSFM and FaceWarehouse bases enabled efficient prototyping. There is a trade-off between representation power and clean noise-free reconstructions: the model needs to be powerful enough to represent a wide range of expressions but restrictive enough that it does not represent the unnatural artifacts.
We project the mouth region of the blendshapes on the PCA mouth model during training and blend them smoothly with their respective source blendshapes, i.e., we project the mouth region of S id on W id and the mouth region of S exp on W exp . Blending should be seamless, but -equally importantly -should also remove artifacts. We propose to define a blending mask intrinsically as a Gaussian kernel of the geodesic distance from S inner : Where c and τ control the geodesic radius for which the PCA model is given a weight of 1, and the rate of decay, respectively. Compared to exponential decay, the squared ratio ((r − c)/τ) 2 allows us to favor more strongly the PCA model when r − c ≤ τ and decay faster for r − c > τ. Enforcing weights of 1 within a certain radius helps ensure the artifacts are entirely removed. The mouth region of the blendshape S (.) is redefined as: With M the blending mask, Y (.),mouth the mouth region in the output of the mesh convolutions, and P (.) the projection matrix on the matching PCA basis. We choose c experimentally. As c varies, we adapt τ to ensure the contribution of the PCA model to the reconstruction of the mouth region is low at the edges of the crop, and avoid seams. For a desired weight ε << 1 at distance r and given c, we compute In practice, we choose c = 3.5e − 2 and ε = 5e − 4. We plot the resulting b(·, c, τ) in Figure 11. In this work, we fixed c and τ for all shapes, on the assumption that the geodesic distance from the inner lips does not vary excessively in the dataset. However, it is perfectly reasonable to consider both parameters to be trainable, or to predict them from the latent vectors z joint , z id or z exp to obtain shape or blendshape-specific blending masks.

Losses
For synthetic scans, we define For real scans, we use the Chamfer distance As in Liu et al. (2019), we discard q from the error if min q∈P in ||p − q|| 2 2 > σ or min We set σ = 5e − 4. For synthetic scans, we let n(p) be the normal vector at vertex p ∈ S, and n in (p) be the normal in the synthetic scan, and define the normal loss as: For real scans, we use where q is the closest point in P in found by Eq. 19. In both cases, we set a weight of λ norm = 1e − 4. Mesh convolutions are aware of the template topology and do not require as much topological regularization as MLPs, we therefore use a weight of λ edge = 5e − 5 for the edge-loss, whose formulation is identical to Liu et al. (2019).
To regularize the attention mechanism during the initial supervised training steps, we assume all points sampled from the synthetic faces are equally fully important and none should be removed. We encourage the attention mask for the points sampled from synthetic scans to be 1 everywhere, using the binary cross entropy loss with a weight λ att = 1e − 4.
Finally, we enforce both an edge loss and 1 loss regularization between the reconstruction and the template in a small crop of the boundary, shown in Figure 6b, to eliminate tearing artifacts. We let λ bnd = 1e − 3.

Training, models, and implementation details
Training data As previously exposed, we use the same raw aligned data as the baseline model of Liu et al. (2019), but do not apply any further pre-processing, including data augmentation. To keep the ratio of identity and expression scans identical, we simply sample from the same scan as many times as required in a given training epoch.
In addition to the seven datasets of Table 1, we further add two large-scale databases of 3D human facial scans. The MeIn3D (Booth et al., 2017(Booth et al., , 2018aBouritsas et al., 2019) database contains 9647 neutral face scans of people of diverse age and ethic background. We also select 17750 scans from the 4DFAB (Cheng et al., 2018) database. 4DFAB contains neutral and expressive scans of 180 subjects captured in 4 sessions spanning a period of 5 years. Each session comprises up to 7 tasks, consisting of either utterances, voluntary, or spontaneous expressions.
For a given subject in the 4DFAB database, we select the first frame of all sequences in the first two tasks as neutral scans. We select the middle frame of every sequence of the first two tasks as expressive scans for the six basic expressions (happy, sad, surprised, angry, disgust, and fear) and utterances. For tasks 3, 4, and 5, we select the frames at 1/3 and 2/3 of the sequence. For task 6, we select the frames at 1/3 and 2/3 of the sequence for the first two sessions, and the middle frame otherwise. We pick the middle frame for all other sequences.
In this work, we evaluate two models. We call SMF our model trained on the same dataset as the baseline. Our model trained with the addition of the MeIn3D and 4DFAB datasets is denoted by SMF+. The breakdown of the dataset for SMF+ is presented in Table 2.
Training procedure The BFM 2009 model was trained on a sample size of 200 subjects, and offers a limited representation of the diversity of human facial anatomy. We found the synthetic data to hinder the performance of the model, and to limit the realistic nature of the reconstructions. Mesh convolution operators learn to represent signals on the desired template topology, we therefore drastically reduce the reliance on synthetic data to only the very first stages of training to condition the attention mechanism.
We first train the encoder and the identity decoder on synthetic data only for 5 epochs; and then on real neutral scans only for a further 10 epochs. We repeat this procedure for the expression decoder by freezing the identity decoder and the identity branch of the encoder, using only expressive scans. We then train both decoders jointly and the encoder for 10 epochs on the entire set of real scans. Finally, we change the batch size to 1 and refine the complete model for 15 epochs on the entire set of real scans.
We set the initial batch size to 2 and 8 for SMF and SMF+, respectively. We train the models with the Adam optimizer (Kingma and Ba, 2014), with a learning rate of 1e − 4, and automatically decay the learning rate by a factor of 0.5 every 5 epochs. No additional regularization is used.
Software implementation and hardware Our model is implemented with Pytorch. We use the CGAL library for the computation of the geodesic distance using the heat method (Crane et al., 2020), implemented in C++ as a Pytorch extension. We render figures using the Mitsuba 2 renderer (Nimier-David et al., 2019). All models were trained on a single Nvidia TITAN RTX, in a desktop workstation with an AMD Threadripper 2950X CPU and 128GB of DDR4 2133MHz memory.
Side by side comparison We summarize the differences between SMF and the baseline in Table 3.

Experimental evaluation: Registration
We first evaluate SMF and SMF+ on surface registration tasks. In addition to the original data from Liu et al. (2019), we test the generalisability of our method on a previously unseen dataset, 3DMD. 3DMD is a high resolution dataset containing in excess of 24, 000 scans captured from more than 3, 000 individuals. The dataset contains subjects from a wide range of ethnicities and age groups, each expressing a variety of facial expressions including neutral, happy, sad, angry and surprised.
We report the median and standard deviation of the error within each group. Table 4 summarizes the results for the BU-3DFE dataset, and Table 5 the results on 3DMD. We note that applying NICP did not significantly change the landmark error, which is likely due to the reconstructions output by SMF and SMF+ being already sufficiently close to the ground truth surface. There is, however, an advantage in using SMF to initialize NICP compared to landmarks: the typical runtime of the public implementation of NICP we used (from the publicly available LSFM code (Booth et al., 2018a)) with landmarks initialization was between 45 and 60 seconds per scan, while the initialization with SMF achieved equally detailed registrations in around 20 seconds per scan.

Surface error
While a low landmark localization error suggests key facial points are faithfully placed in the registration, it does not paint the whole picture and does not indicate the general reconstruction fidelity. In particular, it is affected by the inevitable imprecision of manual labeling, and the error is measured on a small number of points.
To further assess the performance of the models, we measure the surface reconstruction error between the registrations and the ground-truth raw scans. We randomly select a sample of 5000 training scans and a sample of 5000 test scans (from the 3DMD dataset) and measure the distance of the vertices of the registrations to their closest point on the ground-truth surface. We summarize each scan by its median surface error.
Training and test set We visualize the error distribution on the subsets of the training and test sets in Figure 12. Figure 12a provides box plots of the distribution of the median surface error per scan for SMF and SMF+. On both the training and test sets, the models exhibit typically low error, with a pronounced skew of the median towards lower values (0.269mm for SMF and 0.252mm for SMF+), and few flying outliers. On the training set, the median (per scan) error distributions of SMF and SMF+ appear very similar, with a slight advantage to SMF+. On the test set, however, SMF+ displays We split the plots to help compare the error distribution on the two datasets. Vertical dotted lines represent the quartiles of the distribution.
significantly lower values at the quartiles and a tighter distribution, suggesting the addition of the MeIn3D and 4DFAB datasets was effective in reducing the generalization gap and the variance of the model.

BU-3DFE and 3DMD
To complete the evaluation on BU3D and 3DMD, we produce in Figure 13 the cumulative distribution function (CDF) of the surface error for the entire BU3D dataset, and for the aforementioned sample of 5000 test scans, for the same models as in Section 5.1. To help visualize the counts of extreme values, we provide a rug plot for SMF evaluated out of sample. As evidenced by the plots, SMF and SMF+ performed very similarly in sample, while SMF trained without BU-3D had slightly lower performance. The baseline model, on the other hand, had significantly Median surface error on 3DMD Baseline out of sample SMF out of sample SMF+ out of sample Figure 13: Cumulative error: Cumulative distribution function for the median (per scan) error on the training and test sets for the models evaluated for the semantic landmark accuracy experiment. Even though the semantic landmark error of the baseline was not atypical, the distribution of the surface error reveals that the registration accuracy is actually much lower than that of SMF and SMF+. The rug plot (red bars at the bottom) visualize the repartition of the samples in terms of median error for SMF evaluated out of sample. On 5000 sample test scans, few outliers had high median surface error. SMF+ performs comparably with SMF in sample but has distinctly lower generalization error. worse error distribution. Table 6 provides numerical values for the 25%, 50%, 75%, and 99% quantiles for the models we plotted. We omit the baseline evaluated out of sample on BU-3DFE due to the very high landmark localization error. On our separate test set, a similar development unfolds.
The difference between the error distributions of SMF+ and SMF is small but significant, with SMF+ outperforming the model trained on less data. Out of sample, the baseline model's performance is significantly degraded, with the bottom 25% of the surface error already reaching 2.50mm.  Figure 14: Sample reconstructions on the training set for SMF: arranged in two columns. From left to right: raw scan, output of the baseline, point cloud sampled on the scan by SMF, predicted attention mask, output of SMF, and surface reconstruction error visualized as a texture on the output of SMF. We can see SMF markedly outperforms the baseline and provides accurate natural-looking reconstructions with uniformly low error in the facial region and accurate representation of both identity and expression.  Figure 15: Sample reconstructions on the test set for SMF: arranged in two columns. From left to right: raw scan, output of the baseline, point cloud sampled on the scan, predicted attention mask, output of SMF, and surface reconstruction error visualized as a texture on the output of SMF. The test reconstructions look comparable to the training reconstructions for SMF, with high quality registrations across gender, age and ethnicity, even for extreme facial expressions.    Figure 17: Per vertex distance to the mean prediction: We sample 100 different point clouds for 1000 training and test scans and compute, for each vertex in each registration, its median Euclidean distance to the matching vertex in the average reconstruction. We present histograms of the max and median values (across vertices) per scan to show our method is stable to resampling the same input surface.
Training and test reconstructions We visualize sample reconstructions from SMF on the training and test sets. For each scan, we render the input point cloud sampled on the mesh, and the attention score predicted by SMF for every point as a heatmap, with brigh green denoting attention scores close to 1, and black denoting attention scores close to 0. We also render the reconstruction produced by SMF, and the heatmap of the surface error as a texture on the registration. We render the reconstruction produced by the baseline for comparison. Figure 14 provides visualizations for 18 training samples arranged in two columns. Figure 15 shows the comparative performance of the baseline and our model for 12 test scans arranged in two columns. We show sample reconstructions from SMF+ on MeIn3D and 4DFAB in Figure 16.
Visual inspection correlates strongly with the numerical evaluation. Our SMF model consistently produces registrations that are smooth and detailed, with very low surface error. The attention mechanism appears to successfully segment the face, eliminating gross corruption, and discarding points from the tongue and teeth for several scans. Our model faithfully represents both the identity and expression, even for extreme expressions on the test set.
In particular, factors such as age, ethnicity, and gender are accurately captured. Non-linear deformations of the nose, cheeks, and mouth are well preserved across of wide range of identities and expressions. Finally, despite the inclusion of points from the teeth and tongue in the raw scans, SMF produces artifact-free and expressive mouth reconstructions with seamless blending in the vast majority of cases. Figure 18: Attention mask: Attention mask for two point clouds sampled from the same test shape (3DMD). It can be seen the attention mechanism excludes the points inside of the mouth and outside of the face area. The mask is also stable to resampling of the scan.

Robustness
First, given the stochastic nature of the method, we evaluate the stability of the reconstructions to resampling of the input scans. We then focus on evaluating the attention mechanism.
Stability to resampling We select a subsets of 1000 scans each of the training and test sets and produce 100 reconstructions with SMF, randomly sampling a new point cloud on the surface of the scan at each iteration. For each scan, we compute the mean reconstruction. For each point of the 100 reconstructions, we compute its Euclidean distance to the matching point in the mean reconstruction for that scan. We then take the median and max of these distances for every point in the scan and compute their median across the scan, denoted by "median median" and "median max", as indications of the typical typical-case and typical worst-case variations. We collect both values for each of the 1000 training and 1000 test scans, and plot their histograms in Figure  17. The results show our method is stable to resampling, the median median variations, in particular, are concentrated below 0.1mm with a typical maximum variation in position from the mean below 0.2mm per vertex. Interestingly, we observe less spread on the test set than on the training set, but slightly higher typical maximum displacement per vertex, still below 0.2mm per vertex. Figure 18 illustrates that the attention mechanism is also stable.

Ablation study on the encoder
We now evaluate the contribution of the improvements we made to the PointNet encoder (attention mechanism, group normalization) to the quality of the registration by carryingout an ablation study. We train SMF with our modified Point-Net encoder without attention (No att.) and with the vanilla PointNet encoder used in Liu et al. (2019). We visualize the distribution of the surface error on the 5000 training and test scans in Figure 19, as well as that of the baseline. As a re- minder, the baseline is evaluated on the processed (cropped, subdivided) data.
As can be seen in Figure 19, SMF with vanilla Point-Net has lower performance than the baseline, which used a vanilla PointNet trained on cropped scans. The distributions of the training error of SMF with and without attention are extremely close, with the no attention variant actually showing marginally lower error. As shown in Qi et al. (2017a), PointNet summarizes the input point cloud with a few (at most as many as the output dimension of the max pooling layer) points from the input. This property makes PointNet naturally robust to noise, to some extent. When looking at the generalization gap for the models, we can see the surface error increased less for SMF than for the model without attention. These observations suggest our changes all contribute to improved performance and improved generalization. We verify the contribution of the attention mechanism visually in Figure 20. We can see SMF without attention performs well, but reconstructions are noisier for the faceted scans from FRGC, and less details are present in the test 3DMD scan. Revisiting the examples of Figure 2, we can also see the attention mechanism helps discard sensor noise in Figure  21, and in line 3, col. 2 of Figure 14, in which points inside the mouth also received attention scores close to 0.
We will further verify that the attention mechanism improves the quality of reconstructions on noisy out of sample scans in Section 6.4.

Overview
In Section 5.1 and Section 5.2, we showed SMF (and SMF+) systematically outperforms the baseline on landmark localization error and offers performance competitive with NICP. Test set performance, in particular, was markedly higher than The attention mechanism helps reduce noise and improve details on out of sample registrations.  Figure 2: the attention mechanism is able to discard noisy points in badlytriangulated range scans. the current state of the art, and remained very close to the training set error. Direct evaluation of the median surface registration error offers a more complete picture of the registration quality and leads to similar conclusions. Visual inspection of the reconstructions confirms the quantitative analysis: contrary to the baseline, SMF provides noise-free registrations which closely match the raw scans in both identity and expression. We showed re-sampling the scans typically lead to minor variations in their registrations in Section 5.3. Finally, we evaluated the contributions of our modifications of the PointNet encoder in Section 5.4.

A large scale hybrid morphable model
In this section, we assess the morphable model aspects of SMF. We first study the influence of the dimension of the identity and expression latent spaces on surface reconstruction error both in sample and out of sample. We then show SMF can be used to quickly generate realistic synthetic faces. In Section 6.3, we evaluate SMF on shape-to-shape translation applications, namely identity and expression tranfer, and morphing. We conclude by showing SMF can be used successfully for registration and translation fully in the wild.

Dimension of the latent spaces
The classical linear morphable models literature typically reports three main metrics. Specificity is evaluated in Section 6.2.1. Compactness is the proportion of the variance retained for increasing numbers of principal components -a direct correlate of the training error for PCA models. Generalization measures the reconstruction error on the test set for increasing numbers of principal components. Since our model is not linear, we instead report the training and test performance for increasing identity and expression dimensions. We choose symmetric decoders with z id and z exp of equal dimension d. We vary d ∈ {64, 128, 256, 512}. We measure the median (per scan) surface reconstruction error on the same subsets of 5000 training and 5000 test scans used in Section 5. We plot the median error across the 5000 scans along with its 95% confidence interval obtained by bootstrapping in Figure 22.
As  The specificity error is the mean distance of the sampled scans to their projection on the registered training set.

Generating synthetic faces
We now evaluate the generative ability of our SMF+ model.

Specificity error
We follow the literature and measure the specificity error as follows: we sample 10,000 shapes at random from the joint latent space; since our model is not explicitly trained as a generative model, no particular structure is to be expected on the latent space and we therefore model the empirical distribution of the joint latent vectors of the training set with a multivariate Gaussian distribution. We estimate the empirical mean and covariance matrix of the ≈ 54,000 joint latent vectors and generate 10,000 Gaussian random vectors. We apply the pre-trained decoder to obtain generated faces 1 . For each of the 10,000 random faces, we find its closest point in the training set in terms of minimum (over all 54 000 training registrations) of the average (over the 29495 points in the template) vertex-to-vertex Euclidean distance. The mean of these 10,000 distances is the specificity error of the model. For the sake of completeness, we repeated the experiment with the variants of SMF evaluated in Figure 22. We plot the specificity error and its 95% confidence interval computed by bootstrapping in Figure 23. Both SMF and SMF+ offer low specificity error, suggesting realistic-looking samples can be obtained. SMF+, in particular, has markedly lower specificity error than SMF for the same latent space dimensions, which confirms the benefits of training our very large scale model on the extended training set.

Visualization of the samples
We now inspect a random subset of the 10,000 samples in Figure 24. We render each random sample, its closest point in the registered training set, and the raw scan from which the Sample Closest reg.
Closest raw Sample Closest reg.
Closest raw Figure 24: Samples from SMF+: First row: Generated face obtained by sampling a random joint vector. Second row: Closest registration in the trainig set. Third row: Raw scan from which the closest registration was obtained. registration was obtained. We can see the samples generated by SMF+ are highly diverse and realistic-looking: they are close to the registrations of the training set without displaying mode collapse. SMF+ generates detailed faces with sharp features across a wide range of identity, age, ethnic background, and expression, including extreme face and mouth expressions. We further note the absence of artifacts and the seamless blending of the mouth with the rest of the face.

Interpolation in the latent space
We now present a surface-to-surface translation experiment on the training set by showing the results of expression transfer and identity and expression interpolation in the latent spaces of SMF+. Since the latent vectors are hyperspherical, care must be taken to interpolate along the geodesics on the manifold. We therefore interpolate between two latent vectors z 1 and z 2 and t ∈ [0, 1] as We select two expressive scans of two different subjects, referred to as S1 and S2, from two different databases (BU-3DFE and BU-4DFE) displaying distinct expressions (disgust and surprise). We study three cases: simultaneous interpolation of identity and expression, interpolation of identity for a fixed expression, and interpolation of expression for a fixed identity. We render points along the trajectory defined by Equation 23 at t ∈ {0, 0.25, 0.5, 0.75, 1}. The results on the interpolation are presented in Figure 25.
We observe smooth interpolation in all three cases. For simultaneous interpolation, we obtain a continuous morphing of the first expressive scan (t = 0) into the second expressive scan (t = 1). In particular, we note that the midpoint resembles what would be the neutral scan of a subject presenting physical traits of both the source (nose, forehead) and destination (eyes, jawline) subjects. The interpolation of the identity vector for the fixed expression of S1 shows a smooth transition towards S2 while keeping the correct expression. Conversely, interpolation between S2 and S2 with the expression of S1 shows the overall identity is recognizable and the expression displays a smooth evolution from surprise to disgust. These results show our model can be used for expression transfer and smooth interpolation on the training set. In Section 6.4, we evaluate SMF on surface-to-surface translation tasks in the wild.

Face modeling and registration in the wild
We now evaluate SMF on the difficult tasks of registration and manipulation of scans found "in the wild", i.e. in uncontrolled environments, with arbitrary sensor types and acquisition noise. We collected the scans of three subjects, referred to A, B, and C, in various conditions. For subject A (male, Caucasian), we obtained crops of two body scans, acquired at over a year and half's interval using two different body capture setups that produce meshes, in two different environments. The first scan shows a crop of the subject squatting while raising his right eyebrow, the second is of the subject jumping with a neutral face. We further acquired four high density point clouds of subject A performing different facial expressions : neutral, smiling (happy), surprise, and a "complex" compound expression consisting of raising the right eyebrow while opening and twisting the mouth to the left. Scanning was done in an uncontrolled environment using a commodity sensor, namely the embedded depth camera of an iPhone 11 Pro. Subject B (female, Caucasian) was captured posing with a light smile in a different uncontrolled environment, also with an iPhone 11 Pro, but using a lower resolution point cloud. Finally, subject C (male, Caucasian) was captured in a neutral pose using a state of the art light stage setup that outputs very high resolution meshes. All in all, the scans represent four different cameras, in five different environments, at five different levels of detail and surface quality, and across two different modalities (mesh and point cloud).
We use the pre-trained SMF model with and without attention to further extend the ablation study of Section 5.4. Scans were rigidly aligned with the cropped LSFM mean using landmarks. For meshes (body scans, light stage scan), we sample 2 16 input points at random on the surface of the triangular mesh. For point clouds, we select 2 16 points. Figure 26 shows the raw scans, registration from SMF, predicted attention mask for SMF, and registration for SMF trained without visual attention. We can see SMF produced very consistent registrations for subject A across modalities, resolution, and time: it is clear, from the registrations, that the scans came from the same subject, even for the lowresolution face and shoulders region of the first body scan, for which important facial features and the elevated position of the right eyebrow were captured. Comparing the neutral iPhone scan and the neutral body scan further shows identity was robustly captured at the two different resolutions. The highly non-linear complex expression was, also, accurately captured, and so were the more standard happy and surprise expressions. Performance was stable for lower-resolution raw point clouds too as shown with the registration of subject B. SMF produced a sharp detailed registration of the high quality light-stage scan of subject C, correctly capturing the shape of the nose, the sharpness and inflexion of the eyebrows, and the angle of the mouth.
Compared to SMF, SMF trained without our attention mechanism still produced high quality registrations but with fewer details. The two body scans and the light stage scans show clear differences, especially in the eyes. The happy expression of subject B was not captured as accurately, and the shape of the face appears elongated. Looking at the attention masks, we can see our visual attention mechanism discarded points from the body, the inside of the mouth (A surprise), environment noise (C neutral), and hair and partial occlusions (B happy, for which it removed most of the glasses).
Morphing and editing in the wild We now show our pretrained model can be used for shape morphing and editing, such as expression transfer, by linearly interpolating in S 255 between the predicted identity and expression vectors of the raw scans. We select the "A complex", "A surprise" and "C neutral" scans and register both of them with our pretrained SMF model, keeping their predicted identity and expression embeddings. We first interpolate the identity and expression jointly between "A complex" and "C neutral" to produce a smooth morphing. We then keep the identity vector fixed to that of "C neutral" and linearly interpolate between the expression vectors of "C neutral" and "A surprise", this produces a smooth expression transfer. Both experiments are shown as a continuous transformation in Figure 27.

SMF
No att. Figure 26: In the wild registrations with and without attention: the scans of subject A were acquired over a period of two years using three different cameras (two different body capture stages and a commodity depth sensor in a smartphone). The scan of subject B was also acquired using a smartphone depth camera, but using a lower resolution setting. The scan of subject C is from a state of the art facial scanning light stage. SMF provides consistent high-quality registrations even from low-resolution scans comprising large areas of the body, hair, or background. In particular, the six scans of subject A show consistent representation of the identity. The attention mechanism can be seen to markedly improve details in the registrations.
A (complex) t = 0.25 t = 0.5 t = 0.75 C (neutral) t = 0.25 t = 0.5 t = 0.75 C (surprised) Native Native Transferred Figure 27: Interpolation, transfer, and morphing in the wild: From A "complex" to C "neutral" to C "surprised" transferred from A.
As apparent Figure 27, our model is able to smoothly interpolate between subjects and expressions of scans captured, in the wild, across different modalities and resolutions. The morphing from A complex to C neutral produces smooth facial motions without discontinuities. Our model is further able to, not only transfer expressions in the wild, but smoothly interpolate between expression vectors of different subjects for a fixed identity. The interpolation transfer again produces a smooth natural-looking transition between the neutral scan of C, with the mouth and eyebrows smoothly moving from a resting position to a surprise expression, while keeping the facial features of subject C.

Conclusion and Future Work
In this paper, we present Shape My Face (SMF), a novel learning-based algorithm that treats the registration task as a surface-to-surface translation problem. Our model is based on an improved point cloud encoder made highly robust with a novel visual attention mechanism, and on our mesh inception decoders that leverage graph convolutions to learn a compact non-linear morphable model of the human face. We further improve robustness to noise in face scans by blending the output of the mesh convolutions with a specialized statistical model of the mouth in a seamless way. Our model learns to produce high quality registrations both in sample and out of sample, thanks to the improved weight sharing and stochastic training approach that prevent the model from overfitting any particular discretization of the training scans.
We introduce a large scale morphable model, coined as SMF+, by training SMF on 9 comprehensive human 3D facial databases. Our experimental evaluation shows SMF+ can generate thousands of diverse realistic-looking faces from random noise across a wide range of age, ethnicities, genders, and (extreme) facial expressions. We evaluate SMF+ on shape editing and translation tasks and show our model can be used for identity and expression transfer and interpolation. Finally, we show SMF can also accurately register and interpolate between facial scans captured in uncontrolled conditions for unseen subjects and sensors, allowing for shape editing entirely in the wild. In particular, we demonstrated smooth interpolation and transfer of expression and identity between a very high quality mesh acquired in controlled conditions with a sophisticated facial capture environment, and a noisy point cloud produced by consumer-grade electronics.
Future work will investigate improving the reproduction of high frequency details in the scans, and registering texture and geometry simultaneously. Zhu X, Lei Z, Yan J, Yi D, Li SZ (2015) High-fidelity Pose and Expression Normalization for face recognition in the wild. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition