3DFaceGAN: Adversarial Nets for 3D Face Representation, Generation, and Translation

Over the past few years, Generative Adversarial Networks (GANs) have garnered increased interest among researchers in Computer Vision, with applications including, but not limited to, image generation, translation, imputation, and super-resolution. Nevertheless, no GAN-based method has been proposed in the literature that can successfully represent, generate or translate 3D facial shapes (meshes). This can be primarily attributed to two facts, namely that (a) publicly available 3D face databases are scarce as well as limited in terms of sample size and variability (e.g., few subjects, little diversity in race and gender), and (b) mesh convolutions for deep networks present several challenges that are not entirely tackled in the literature, leading to operator approximations and model instability, often failing to preserve high-frequency components of the distribution. As a result, linear methods such as Principal Component Analysis (PCA) have been mainly utilized towards 3D shape analysis, despite being unable to capture non-linearities and high frequency details of the 3D face - such as eyelid and lip variations. In this work, we present 3DFaceGAN, the first GAN tailored towards modeling the distribution of 3D facial surfaces, while retaining the high frequency details of 3D face shapes. We conduct an extensive series of both qualitative and quantitative experiments, where the merits of 3DFaceGAN are clearly demonstrated against other, state-of-the-art methods in tasks such as 3D shape representation, generation, and translation.


Introduction
GANs are a promising unsupervised machine learning methodology implemented by a system of two deep neural networks competing against each other in a zero-sum game framework (Goodfellow et al. 2014). GANs became immediately very popular due to their unprecedented capability in terms of implicitly modeling the distribution of visual data, thus being able to generate and synthesize novel yet realistic images and videos, by preserving high-frequency details of the data distribution and hence appearing authentic to human observers. Many different GAN architectures have been proposed over the past few years, such as the Deep Convolutional GAN (DCGAN) (Radford et al. 2015) and the Progressive GAN (PGAN) (Karras et al. 2018), which was the first to show impressive results in generation of highresolution images.
A type of GANs which has also been extensively studied in the literature is the so-called Conditional GAN (CGAN) (Mirza and Osindero 2014), where the inputs of the generator as well as the discriminator are conditioned on the class labels. Applications of CGANs include domain transfer Bousmalis et al. 2017;Tzeng et al. 2017), image completion Yang et al. 2017;Wang et al. 2017), image super-resolution (Nguyen et al. 2018;Johnson et al. 2016;Ledig et al. 2017) and image translation Zhu et al. 2017;Choi et al. 2017;Wang et al. 2018).
Despite the great success GANs have had in 2D image/video generation, representation, and translation, no GAN method tailored towards tackling the aforementioned tasks in 3D shapes has been introduced in the literature. This is primarily attributed to the lack of appropriate decoder networks for meshes that are able to retain the high frequency details (Dosovitskiy and Brox 2016;Jackson et al. 2017 (x,y,z) to (R,G,B) Fig. 1 A graphical representation of the data preprocessing step. We begin by applying non-rigidly a mesh template to the raw scan and we later store the spatial information of the vertices (x, y, z) into a UV space. Lastly, a 2D nearest point interpolation is performed to fill out the missing values.

High quality ground truth
High quality 3DFaceGAN output Low quality Input Fig. 2 Results of 3DFaceGAN in the shape translation task on test data of the proposed Hi-Lo database. The first row of shapes shows the low quality facial meshes captured by a low cost sensor, whereas the bottom row depicts the same subjects captured in high quality by an expensive high-end apparatus. The middle row shows our shape translation output results when the network takes as inputs the low quality 3D facial scans.
In this paper, we study the task of representation, generation, and translation of 3D facial surfaces using GANs. Examples of the applications of 3DFaceGAN in the tasks of 3D face translation as well as 3D face representation and generation are presented in Fig. 2 and Fig. 3, respectively. Due to the fact that (a) the use of volumetric representation leads to very low-quality representation of faces (Fan et al. 2017;Qi et al. 2017), and (b) the current geometric deep learning approaches (Bronstein et al. 2017), and especially spectral convolution, preserve only the low-frequency details of the 3D faces, we study approaches that use 2D convolutions in a UV unwrapping of the 3D face. The process of unwrapping a 3D face in the UV domain is shown in Fig. 1. Overall, the contributions of this work can be summarized as follows.
-We introduce a novel autoencoder-like network architecture for GANs, which achieves state-of-the-art results in tasks such as 3D face representation, generation, and translation. -We introduce a novel training framework for GANs, especially tailored for 3D facial data.
-We introduce a novel process for generating realistic 3D facial data, retaining the high frequency details of the 3D face.
The rest of the paper is structured as follows. In Section 2, we succinctly present the various methodologies that can be utilized in order to feed 3D facial data into a deep network and argue why the UV unwrapping of the 3D face was the method of choice. In Section 3, we present all the details with respect to 3DFaceGAN training process, losses, and model architectures. Finally, in Section 4, we provide information about the database we collected, the preprocessing we carried out in the databases we utilized for the experiments and lastly we present extensive quantitative and qualitative experiments of 3DFaceGAN against other stateof-the-art deep networks.

3D face representations for deep nets
The most natural representation of a 3D face is through a 3D mesh. Adopting a 3D mesh representation requires appli-

Real x1
Recon x1 Real x2 Recon x2 Interpolation in the latent space Fig. 3 3D face representation and generation utilizing the proposed 3DFaceGAN. In (a) we demonstrate the 3D face representation capability of 3DFaceGAN. The first row shows the reconstructed 3D faces whereas the second row shows the corresponding real 3D faces. As evidenced, 3DFaceGAN is able to capture and reconstruct non-linear details of the 3D face such as lips, eyelids, etc. In (b) we present the generative nature of 3DFaceGAN. The left and right hand side show the real 3D face targets. The generated samples in between show the reconstructions and the interpolations of the targets in the latent space.
cation of mesh convolutions defined on non-Euclidean domains (i.e., geometric deep learning methodologies 1 ). Over the past few years, the field of geometric deep learning has received significant attention (Maron et al. 2017;Litany et al. 2017b;Lei et al. 2017). Methods relevant to this paper are auto-encoder structures such as Ranjan et al. (2018); Litany et al. (2017a). Nevertheless, such auto-encoders, due to the type of convolutions applied, mainly preserve lowfrequency details of the meshes. Furthermore, architectures that could potentially preserve high-frequency details, such as skip connections, have not yet been attempted in geometric deep learning. Therefore, geometric deep learning methods are not yet suitable for the problem we study in this paper.
Another way to work with 3D meshes is to concatenate the coordinates of the 3D points in an 1D vector and utilize fully connected layers to decode correctly the structure of the point cloud (Fan et al. 2017;Qi et al. 2017). Nevertheless, in this way the triangulation and spatial adjacent information is lost and the number of the parameters describing this formulation is extremely large which makes the network hard to train.
Recently, many approaches aim at regressing directly on the latent parameters of a learned model space, e.g., PCA, rather than the 3D coordinates of points (Richardson et al. 2017;Tran et al. 2017;Dou et al. 2017;Genova et al. 2018). This formulation limits the geometrical details of the 3D representations and is restricted to their latent model space. In contrast, a 3D volumetric space is introduced in Jackson et al. (2017) as a representation of a 3D structure and exploits a Volumetric Regression Network which outputs a discretized version of the 3D structure. Due to discretization, the predicted 3D shape has low quality and corresponds to non-surface points that are difficult to handle.
Lastly, in Feng et al. (2018), a UV spatial map framework is utilized where the 3D coordinates of the points are stored in a UV space instead of the texture values of the mesh. This formulation exhibits a very good representation for 3D meshes where there are no overlapping regions and the mesh is optimally unwrapped. Since the 3D mesh is transferred in a 2D UV domain, we are then able to use 2D convolutions, with the whole range of capabilities they offer. As a result, this is our preferred methodology for preprocessing the 3D face scans, as further explained in Section 4.2.

Pre-training the Discriminator
Training 3DFaceGAN Fig. 4 3DFaceGAN training process in a nutshell. The networks receive (extract) 2D facial UVs as inputs (outputs). The corresponding 3D faces are shown below or next to them. We firstly pre-train D (left figure). We then use the learned weights/biases to initialize D and G and subsequently start the adversarial training (right figure). The decoder parts of D and G are depicted in red color as we freeze the weights/biases updates during the training phase of 3DFaceGAN.

3DFaceGAN
In this Section we describe the training process, network architectures, and loss functions we utilized for 3DFaceGAN. Moreover, we discuss the framework we utilized for 3D face generation as well as present an extension of 3DFaceGAN which is able to handle data annotated with multiple labels.

Objective function
The main objective of the generator G is to retrieve a facial UV map x as input and generate a fake one, G (x), which in turn should be as close as possible to the real target facial UV map y. For example, in the case of 3D face translation, the input can be a neutral face and the output a certain expression (e.g., happiness) or in the case of 3D face reconstruction the input can be a 3D facial UV map and the output a reconstruction of the particular 3D facial UV map. The goal of the discriminator D is to distinguish between the real (y) and fake (G (x)) facial UV maps. Throughout the training process, D and G compete against each other until they reach an equilibrium, i.e., until D can no longer differentiate between the fake and the real facial UV maps.
Adversarial loss. To achieve the 3DFaceGAN objective, we propose to utilize the following loss for the adversarial part. That is, where D (·) refers to the output of the discriminator D, L (x) . = x − D(x) 1 , and λ adv is the hyper-parameter which controls how much weight should be put on L (G(x)). The higher the λ adv , the more emphasis D puts on the task of differentiating between the real and fake data. The lower the λ adv , the more emphasis D puts on reconstructing the actual real data. There is a fine line between which task D should primarily focus on by adjusting λ adv . In our experiments we deduced that for relatively low values of λ adv we retrieve optimal performance as then D is able to influence the updates of G in such a way that the generated facial UV maps are more realistic. During the adversarial training, D tries to minimize L D whereas G tries to minimize L G . Similar to recent works such as Zhao et al. (2016); Berthelot et al. (2017), the discriminator D has the structure of an autoencoder. Nevertheless, the main differences are that (a) we do not make use of the margin m as in Zhao et al. (2016) or the equilibrium constraint as in Berthelot et al. (2017), and (b) we use the autoencoder structure of the discriminator and pre-train it with the real UV targets prior to the adversarial training. Further details about the training procedure are presented in Section 3.2. Reconstruction loss. With the utilization of the adversarial loss (1), the generator G is trying to "fool" the discriminator D. Nevertheless, this does not guarantee that the fake facial UV will be close to the corresponding real, target one. To impose this, we use an L1 loss between the fake sample G (x) and the corresponding real one, y, so that they are as similar as possible, as in Isola et al. (2017). Namely, the reconstruction loss is the following. (2) Full objective. In sum, taking into account (1) and (2), the full objective becomes where λ rec is the hyper-parameter that controls how much emphasis should be put on the reconstruction loss. Overall, the discriminator D tries to minimize L D while the generator G tries to minimize L G .

Training procedure
In this Section, we first describe how we pre-train the discriminator (autoencoder) D and then provide details with respect to the adversarial training of 3DFaceGAN.
Pre-training the discriminator. The majority of GANs in the literature utilize discriminator architectures with logit outputs that correspond to a prediction on whether the input fed into the discriminator is real or fake. Recently proposed GAN variations have nevertheless taken a different approach, namely by utilizing autoencoder structures as discriminators (Zhao et al. 2016;Berthelot et al. 2017). Using an autoencoder structure in the discriminator D is of paramount importance in the proposed 3DFaceGAN. The benefit is twofold: (a) we can pre-train the autoencoder D acting as discriminator prior to the adversarial training, which leads to better quantitative as well as more compelling visual results 2 , and (b) we are able to compute the actual UV space dense loss, as compared to simply deciding on whether the input is real or fake. As we empirically show in our experiments and ablation studies, this approach encourages the generator to produce more realistic results than other, state-of-the-art methodologies. Adversarial training. Before starting the adversarial training, we initialize the weights and biases 3 for both the generator G and the discriminator D utilizing the learned parameters estimated after the pre-training of D (the architecture of G is identical to the architecture of D). During the training phase of 3DFaceGAN, we freeze the parameter updates in the decoder parts for both the generator G and the discriminator D. Furthermore, we utilize a low learning rate on the encoder and bottleneck parts of G and D so that overall the parameter updates are relatively close to the ones found during the pre-training of D. Network architectures. The network architectures for both the discriminator D and the generator G are the same. In particular, each network is consisted of 2D convolutional blocks with kernel size of three, stride and padding size of one. Down-sampling is achieved by average 2D pooling with kernel and stride size of two. The convolution filters grow linearly in each down-sampling step. Up-sampling is implemented by nearest-neighbor with scale factor of two.
The activation function that is primarily used is ELU (Clevert et al. 2015), apart from the last layer of both D and G where Tanh is utilized instead. At the bottleneck we utilize fully connected layers and thus project the tensors to a latent vector b ∈ R N b . To generate more compelling visual results, we utilized skip connections (He et al. 2016;Huang et al. 2017) in the first layers of the decoder part of both the generator and the discriminator. Further details about the network architectures are provided in Table 1.

3D face generation
Variational autoencoders (VAEs) (Kingma and Welling 2013) are widely used for generating new data using autoencoder-like structures. In this setting, VAEs add a constraint on the latent embeddings of the autoencoders that forces them to roughly follow a normal distribution. We can then generate new data by sampling a latent embedding from the normal distribution and pass it to the decoder. Nevertheless, it was empirically shown that enforcing the embeddings in the training process to follow a normal distribution leads to generators that are unable to capture high frequency details (Litany et al. 2017a). To alleviate this, we propose to generate data using Algorithm 1, which better retains the generated data fidelity, as shown in Section 4.

3DFaceGAN for multi-label 3D data
Over the last few years, databases annotated with regards to multiple labels are becoming available in the scientific community. For instance, 4DFAB (Cheng et al. 2018) is a publicly available 3D facial database containing data annotated with respect to multiple expressions. We can extend 3DFaceGAN to handle data annotated with regards to multiple labels as follows. Without any loss of generality, suppose there are three labels in the database (e.g., expressions neutral, happiness and surprise). We adopt the so-called one-hot representation and thus denote the existence of a particular label in a datum by 1 and the absence by 0. For example, a 3D face datum annotated with the label happiness will have the following label representation: l = [0, 1, 0], where the first entry corresponds to the label neutral, the second to the label happiness and the third to the label surprise. We then choose the desired l we want to generate (e.g., if we want to translate a neutral face to a surprised one, we would choose l = [0, 0, 1]) and then spatially replicate it and concatenate it in the input that is then fed to the generator. The real target is the actual expression (in this case surprise) with the corresponding l spatially replicated and concatenated. Apart from this change, the rest of the training process is exactly the same as the one described in Section 3.2. Table 1 Generator/Discriminator network architectures of 3DFaceGAN. As far as the notation is concerned, C denotes the number of input/output channels, K denotes the kernel size, S denotes the stride size, P denotes the padding size, AvgPool2D denotes average 2D pooling, UpNN denotes nearest-neighbor upsampling, and SF refers to the scaling factor size of the nearest-neighbor upsampling. CONV-BLOCK(C1, C2, K, S, P) and DECONV-BLOCK(C1, C2, K, S, P) refer to a block of two convolutions where the first is CONV(C1, C2, K, S, P) followed by an ELU (Clevert et al. 2015) activation function and the second is CONV(C2, C2, K, S, P), also followed by an ELU (Clevert et al. 2015) activation function.

Part
Input → Output shape Layer information Finally, to generate 3D facial data with respect to a particular label, we follow the same process as the one presented in Algorithm 1, with the only difference being that we extract different pairs of (µ Z , Σ Z ) for every subset of the data, each corresponding to a particular label in the database. We then choose the pair (µ Z , Σ Z ) corresponding to the desired label and sample from this multi-variate Gaussian distribution.

Experiments
In this Section we (a) describe the databases which we used to carry out the experiments utilizing 3DFaceGAN, (b) provide information with respect to the data preprocessing we conducted prior to feeding the 3D data into the network, (c) succinctly describe the baseline state-of-the-art algorithms we employed for comparisons and (d) provide quantitative as well as qualitative results on a series of experiments that demonstrate the superiority of 3DFaceGAN.

The Hi-Lo database
Hi-Lo database contains approximately 6, 000 3D facial scans captured during a special exhibition in the Science Museum, London. It is divided into the high quality data (Hi) recorded with a 3dMD face capturing system and the low quality (Lo) data captured with a V1 Kinect sensor. All the subjects were recorded in neutral expression. The overlapping subjects that were recorded in both frameworks were approximately 3, 000.
The 3dMD apparatus utilizes a 4 camera structured light stereo system which can create 3D triangular surface meshes composed of approximately 60, 000 vertices joined into approximately 120, 000 triangles. Moreover, the low quality database was captured with a KinectFusion framework (Newcombe et al. 2011). In contrast to the 3dMD system, multiple frames are required to build a single 3D representation of the subject's face. The fused meshes were built by employing a 6, 083 voxel grid. In order to accurately reconstruct the entire surface of the faces, a circular motion scanning pattern was carried out. Each subject was instructed to stay still in a fixed pose during the entire scanning process Algorithm 1: 3D face generation algorithm.
Step 4: Extract the mean µ Z of Z and the covariance Σ Z of the zero-mean Z.
Step 5: To generate new data, retain only the trained Bottleneck 2 and the Decoder part of G (see Table 1 for the network structures) and sample a new z i (i.e., Bottleneck 2 input) from the multivariate Gaussian N (µ Z , Σ Z ).
with a neutral facial expression. The frame rate for every subject was constant at 8 frames per second. Furthermore, all 3, 000 subjects provided metadata about themselves, including their gender, age, and ethnicity. The database covers a wide variety of age, gender (48% male, 52% female), and ethnicity (82% White, 9% Asian, 5% Mixed Heritage, 3% Black and 1% other).
Hi-Lo database was utilized for the experiments of 3D face representation and generation, where we utilized the high quality data to train 3DFaceGAN. Moreover, Hi-Lo database was used for demonstrating the capabilities of 3DFaceGAN in a 3D face translation setting, where the low quality data are translated into high quality ones. In all of the training tasks, 85% of the data were used for training and the rest were used for testing.

4DFAB database
4DFAB database (Cheng et al. 2018) contains 3D facial data from 180 subjects (60 females, 120 males), aged from 5 to 75 years old. The subjects vary in their ethnicity background, coming from more than 30 different ethnic groups. For the capturing process, the DI4D dynamic capturing system 4 was used.
4DFAB (Cheng et al. 2018) contains data varying in expressions, such as neutral, happiness, and surprise. As a result, we utilized it to showcase 3DFaceGAN's capability in successfully handling data annotated with multiple labels in the task of 3D face translation as well as generation. In all of the training tasks, 85% of the data were used for training and the rest were used for testing.

Data preprocessing
In order to feed the 3D data into a deep network several steps need to be carried out. Since we employ various databases, the representation of the facial topology is not consistent in terms of vertex number and triangulation. To this end, we need to find a suitable template T that can easily retain the information of all raw scans across all databases and describe them with the same triangulation/topology. We utilized the mean face mesh of the LSFM model proposed by Booth et al. (2016), which consists of approximately 54, 000 vertices that are sufficient to capture high frequency facial details. We then bring the raw scans in dense correspondence by morphing non-rigidly the template mesh to each one of them. For this task, we utilize an optimal-step Non-rigid Iterative Closest Point algorithm (De Smet and Van Gool 2010) in combination with a per vertex weighting scheme. We weight the vertices according to the Euclidean distance measured from the tip of the nose. The greater the distance from the nose tip, the bigger the weight that is assigned to that vertex, i.e., less flexible to deform. In that way we are able to avoid the noisy information recorded by the scanners on the outer regions of the raw scans.
Following the analysis of the various methods of feeding 3D meshes in deep networks in Section 2, we chose to describe the 3D shapes in the UV domain. UV maps are usually utilized to store texture information. In our case, we store the spatial location of each vertex as an RGB value in the UV space. In order to acquire the UV pixel coordinates for each vertex, we start by unwrapping our mesh template T into a 2D flat space by utilizing an optimal cylindrical unwrapping technique proposed by Booth and Zafeiriou (2014). Before storing the 3D coordinates into the UV space, all meshes are aligned in the 3D spaces by performing the General Procrustes Analysis (Gower 1975) and are normalized to be in the scale of [1, −1]. Afterwards, we place each 3D vertex in the image plane given the respective UV pixel coordinate. Finally, after storing the original vertex coordinates, we perform a 2D nearest point interpolation in the UV domain to fill out the missing areas in order to produce a dense representation of the originally sparse UV map. Since the number of vertices in S T is more than 50K, we choose a 256 × 256 × 3 tensor as the UV map size, which assists in retrieving a high precision point cloud with negligible resampling errors. A graphical representation of the preprocessing pipeline can be seen in Figure 1.  Fig. 6 Qualitative results of 3DFaceGAN compared to CoMA (Ranjan et al. 2018) in the 3D representation task. Moreover, heatmaps are provided, visualizing the errors of both approaches against the ground truth test data. As evidenced, 3DFaceGAN is able to better capture the variation in the test data, especially in the eye and nose regions, where most of the non-linearities are present.

Training
We trained all 3DFaceGAN models utilizing Adam (Kingma and Ba 2014) with β 1 = 0.5 and β 2 = 0.999. The batch size we used for the pre-training of the discrminator was 32 for a total of 300 epochs. The batch size we used for 3DFaceGAN was 16 for a total of 300 epochs. For our model we used n = 128 convolution filters and a bottleneck of size b = 128. The total number of trainable parameters was 38.5 × 10 6 . The learning rates that we used for both the pre-training and training of the discriminator was 5e − 5 and the same was for the training of the generator. We linearly decayed the learning rate by 5% every 30 epochs during training. For the rest of the parameters, we  used λ adv = 1e − 3, λ rec = 1. Overall training time on a GV100 NVIDIA GPU was about 5 days.

3D Face Representation
In the 3D face representation (reconstruction) experiments, we utilize the high quality 3D face data from the Hi-Lo database to train the algorithms. In particular, we feed the high quality 3D data as inputs to the models and use the same data as target outputs. Before providing the qualitative as well as quantitative results, we briefly describe the baseline models we compared against as well as provide information about the error metric we used for the quantitative assessment.

Baseline models
In this Section we briefly describe the state-of-the-art models we utilized to compare 3DFaceGAN against.

Vanilla Autoencoder (AE)
Vanilla Autoencoder follows exactly the same structure of the discriminator we used in 3DFaceGAN. We used the same values for the hyper-parameters and the same optimization process. This is the main baseline we compared against and the results are provided in the ablation study in Section 4.4.3.

Convolutional Mesh Autoencoder (CoMA)
In order to train CoMA (Ranjan et al. 2018), we use the authors' publicly available implementation and utilize the default parameter values, the only difference being that the bottleneck size is 128, to make a fair comparison against 3DFaceGAN, where we also used a bottleneck size of 128.

Principal Component Analysis (PCA)
We employ and train a standard PCA model (Jolliffe 2011) based on the meshes of our database we used for training. We aimed at retaining the 98% of variance of our available training data which corresponds to the first 50 principal components.

Progressive GAN (PGAN)
In order to train PGAN (Karras et al. 2018), we used the authors' publicly available implementation with the default parameter values. After the training is complete, in order to represent a test 3D datum, we invert the generator G as in Lucic et al. (2018) and Mahendran and Vedaldi (2015), i.e., we solve z * = argmin x − G(z) by applying gradient descent on z while retaining G fixed (Mahendran and Vedaldi 2015).

Error metric
A common practice when it comes to evaluating statistical shape models is to estimate the intrinsic characteristics, such as the generalization of the model (Davies et al. 2008).
The generalization metric captures the ability of a model to Low Quality Scan High Quality Scan 3DFaceGAN pix2pix pix2pixHD pix2pixHD smoothed pix2pix smoothed Fig. 8 The qualitative results of our approach compared to state-of-the-art baseline GAN methods in the 3D face translation task. The first column depicts the low quality input mesh whereas the second column represent the high quality ground truth meshes. We depict the raw results of pix2pixHD ) and pix2pix ) along with their smoothed versions. As a smoothing technique we utilized a standard Laplacian smoothing operator.
represent unseen 3D face shapes during the testing phase. Table 2 presents the generalization metric for 3DFaceGAN compared against the baseline models. In order to compute the generalization error for a given model, we compute the per-vertex Euclidean distance between every sample of the test set and its corresponding reconstruction. We observe that the model which holds the best error results and thus demonstrates greater generalization capabilities is the proposed 3DFaceGAN with mean error 0.0031 and standard deviation 0.0028. Additionally, as shown in Fig. 5a, which depicts the cumulative error distribution of the normalized dense vertex erors, 3DFaceGAN outperforms all of the baseline models.

Ablation study
In this ablation study we investigate the importance of pretraining the discriminator D prior to the adversarial training of 3DFaceGAN as well as the freezing of the weights in the decoder parts of both D and G. More specifically, we compare 3DFaceGAN against the Vanilla Autoencoder (AE) and another two 3DFaceGAN possible variations, namely (a) the simplest case, where the discriminator and generator structures are retained as is, but no pre-training takes place prior to the adversarial training (we refer to this method-ology as 3DFaceGAN V2 ), (b) the case where (i) the discriminator and generator structures are retained as is, (ii) we pre-train the discriminator and initialize both the generator and the discriminator with the learned weights with no parameters frozen during the adversarial training (we refer to this methodology as 3DFaceGAN V3 ). As shown in Fig. 5b and Table 5, 3DFaceGAN outperforms Vanilla AE and 3DFaceGAN V2 by a large margin. Moreover, 3DFace-GAN also outperforms 3DFaceGAN V3. As a result, not only does 3DFaceGAN have the best performance among the compared 3DFaceGAN variants, but it also requires less training time compared to 3DFaceGAN V3, as the parameters in the decoder parts of both the generator and the discriminator are not updated during the training phase and thus need not be computed.

3D Face Translation
In the 3D face translation experiments, we utilize the low and high quality 3D face data from the Hi-Lo database to train the algorithms. In particular, we feed the low quality 3D data as inputs to the models and use the high quality data as target outputs. Before providing the qualitative as well as quantitative results, we briefly describe the baseline models we compared against as well as provide information about the error metric we used for the quantitative assessment.

Baseline models
In this Section we briefly describe the state-of-the-art deep models we utilized to compare 3DFaceGAN against.   Fig. 10 Reconstruction quality of our proposed GAN network along with pix2pixHD ) and pix2pix  in the 3D face translation task. As it can be seen, the mean error of 3DFaceGAN is considerably less than the other two approaches.

Denoising Vanilla Autoencoder (Denoising AE)
Denoising Vanilla Autoencoder follows exactly the same structure as the Vanilla AE in Section 4.4, the only difference being the inputs fed to the network. This is the main baseline we compared against and the results are provided in the ablation study in Section 4.5.3.

Denoising Convolutional Mesh Autoencoder (Denoising CoMA)
Denoising CoMA (Ranjan et al. 2018), follows exactly the same structure as the Vanilla AE in Section 4.4, the only difference being again the inputs fed to the network.
pix2pix pix2pix ) is amongst the most widely utilized GANs for image to image translation applications. We used the official implementation and hyper-parameter initializations provided by the authors in .

pix2pixHD
More recently pix2pixHD ) was proposed, which can be considered as an extension of pix2pix ) and which is able to better handle data of higher resolution. We used the official implementation and hyperparameter initializations provided by the authors in Wang et al. (2018). As evinced in Fig. 8, Fig. 9, and Fig. 10, pix2pixHD ) outperforms pix2pix , and this is expected since pix2pixHD ) uses more intricate structures for both the generator and discriminator networks.

Error metric
For each low quality test mesh we aim to estimate the high quality representation based on the 3dMD ground truth data. The error metric between the estimated and the real high quality mesh is a standard 3D Root Mean Square Error (3DRMSE) where the Euclidean distances are computed between the two meshes and normalized based on the interocular distance of the test mesh. Before computing the metric error we perform dense alignment between each test mesh and its corresponding ground truth by implementing an iterative closest point (ICP) algorithm (Besl and McKay 1992). In order to avoid any inconsistencies in the alignment we compute a point-to-plain rather than a point-to-point error. Finally, the measurements are performed in the inner part of the face, where we crop each test mesh at a radius of 150mm around the tip of the nose. As can be clearly seen in Fig. 9a as well as in Table 4, 3DFaceGAN outperforms all of the compared state-of-the-art methods.

Ablation study
For the ablation study in this set of experiments, we use exactly the same 3DFaceGAN variants as the ones we utilized in Section 4.4.3. Moreover, instead of the vanilla AE in this experiment we utilize the denoising AE. As evinced in Fig.  9b and Table 5, 3DFaceGAN clearly outperforms all of the compared models.

Multi-label 3D Face Translation
In this experiment we utilize 4DFAB (Cheng et al. 2018) for the multi-label transfer of expressions. In particular, we feed the neutral faces to the models and receive as outputs either the ones bearing the label happiness or surprise. It should be noted here that whereas 3DFaceGAN requires only a single model to be trained under the multi-label expression translation scenario, the rest of the compared models require different trained models for each label, i.e., a model for expression happiness and a model for expression surprise. As baseline models for comparisons, we use exactly the same as the ones in Section 4.5, the only difference being the inputs fed to network as well as the corresponding targets. Qualitative comparisons against the compared methods are presented in Fig. 11.  Fig. 11 Qualitative results of our approach compared to state-of-the-art baseline GAN methods in the multi-label 3D face translation task in various expressions (e.g., happiness, surprise) trained with the 4DFAB (Cheng et al. 2018) database. The first column depicts the neutral input mesh whereas the rest of the columns represent the translated meshes of the respective state-of-the art methods compared to our approach. As can be seen, 3DFaceGAN is able to retain the high-frequency details in a higher level compared to CoMA (Ranjan et al. 2018), the second best method, which produces more smoothed outputs.

3D Face Generation
In the 3D face generation experiment, we utilized the high quality data of the Hi-Lo database to train the algorithms. In particular, we feed the high quality 3D data as inputs to the models and use the same data as target outputs.

Baseline models
The baseline models we used in this set of experiments are the same as the ones presented in Section 4.4.

Error metric
The metric of choice to quantitatively assess the performance of the models in this set of experiments is specificity (Brunton et al. 2014). For a randomly generated 3D face, specificity metric measures the distance of this 3D face to its nearest real 3D face belonging in the test, in terms of minimum per vertex distance over all samples of the test set. To evaluate this metric, we randomly generate n = 10, 000 face meshes from each model. Table 6 reports the specificity metric for 3DFaceGAN compared against the baseline models. In order to generate random meshes utilizing 3DFaceGAN, we sample from a multivariate Gaussian dis- Table 6 Specificity metric on the test set for the 3D face generation task. We generate 10, 000 random faces from each model. The table reports the mean error (Mean) and the standard deviation (std).

Method
Mean tribution, as explained in Section 3.3. To generate random meshes utilizing PGAN (Karras et al. 2018), we sample new latent embeddings from the multivariate normal distribution and feed them to the generator G. To generate random faces utilizing CoMA (Ranjan et al. 2018), we utilize the proposed variational convolutional mesh autoencoder structure, as described in (Ranjan et al. 2018). For the PCA model (Jolliffe 2011), we generate meshes directly from the latent eigenspace by drawing random samples from a Gaussian distribution defined by the principal eigenvalues. As shown in Table 6, 3DFaceGAN achieves the best specificity error, outperforming all compared methods by a large margin. In Fig. 7, we present various visualizations of realistic 3D faces generated by 3DFaceGAN. As can be clearly seen, 3DFaceGAN is able to generate data varying in ethnicity, age, etc., thus capturing the whole population spectrum.

Multi-label 3D Face Generation
In this set of experiments, we utilized the 4DFAB (Cheng et al. 2018) data to generate random subjects of various ex-pressions such as happiness and surprise, as seen in Fig. 12. The 3D faces were generated utilizing the methodology detailed in Section 3.4. As evinced, 3DFaceGAN is able to generate expressions of subjects varying in age and ethnicity, while retaining the high-frequency details of the 3D face.

Conclusion
In this paper we presented the first GAN tailored for the tasks of 3D face representation, generation, and translation. Leveraging the strengths of autoencoder-based discriminators in an adversarial framework, we propose 3DFaceGAN, a novel technique for training on large-scale 3D facial scans. As shown in an extensive series of quantitative as well as qualitative experiments against other state-of-the-art deep networks, 3DFaceGAN improves upon state-of-the-art algorithms for the tasks at-hand by a significant margin.