SliderGAN: Synthesizing Expressive Face Images by Sliding 3D Blendshape Parameters

Image-to-image (i2i) translation is the dense regression problem of learning how to transform an input image into an output using aligned image pairs. Remarkable progress has been made in i2i translation with the advent of Deep Convolutional Neural Networks (DCNNs) and particular using the learning paradigm of Generative Adversarial Networks (GANs). In the absence of paired images, i2i translation is tackled with one or multiple domain transformations (i.e., CycleGAN, StarGAN etc.). In this paper, we study a new problem, that of image-to-image translation, under a set of continuous parameters that correspond to a model describing a physical process. In particular, we propose the SliderGAN which transforms an input face image into a new one according to the continuous values of a statistical blendshape model of facial motion. We show that it is possible to edit a facial image according to expression and speech blendshapes, using sliders that control the continuous values of the blendshape model. This provides much more flexibility in various tasks, including but not limited to face editing, expression transfer and face neutralisation, comparing to models based on discrete expressions or action units.


Introduction
Interactive editing of the expression of a face in an image has countless applications including but not limited to movies post-production, computational photography, face recognition (i.e.expression neutralisation) etc.In computer graphics facial motion editing is a popular field, nevertheless mainly revolves around constructing person-specific models having a lot of training samples [27].Recently, the advent of machine learning, and especially Deep Convolutional Neural Networks (DC-NNs) provide very exciting tools making the community to re-think the problem.In particular, recent advances in Generative Adversarial Networks (GANs) provide very exciting solutions for image-to-image (i2i) translation.
i2i translation, i.e. the problem of learning how to transform aligned image pairs, has attracted a lot of attention during the last few years [18,35,12].The socalled pix2pix model and alternatives demonstrated excellent results in image completion etc. [18].In order to perform i2i translation in absence of image pairs the so-called CycleGAN was proposed, which introduced a cycle-consistency loss [35].CycleGAN could perform i2i translation between two domains only (i.e. in the presence of two discrete labels).The more recent StarGAN [12] extended this idea further to accommodate multiple domains (i.e.multiple discrete labels).
StarGAN can be used to transfer an expression to a given facial image by providing the discrete label of the target expression.Hence, it has quite small capabilities in expression editing and arbitrary expression transfer.The past year quite some deep learning related methodologies have been proposed for transforming facial images [12,33,25].The most closely related work to us is the recent work [25] that proposed the GANimation model.GANimation follows the same line of research as StarGAN to translate facial images according to the activation of certain facial Action Units (AUs) 1 and their intensities.Even though AU coding is a quite comprehensive model for describing facial motion, detecting AUs is currently an open problem both in controlled, as well as in unconstrained recording conditions2 [7,6].
In particular, in unconstrained conditions for certain AUs the detection accuracy is not high-enough yet [7,6], which affects the generation accuracy of GANimation3 .One of the reasons of the low accuracy of automatic annotation of AUs, is the lack of annotated data and the high cost of annotation which has to be performed by highly trained experts.Finally, even though AUs 10-28 model mouth and lip motion, only 10 of them can be automatically recognized (10,12,14,15,17,20,23,25,26,28) which can only be achieved with low accuracy and thus, they cannot describe all possible lip motion patterns produced during speech.Hence, GANimation model cannot be used in straightforward manner for transferring speech.
In this paper, we are motivated by the recent successes in 3D face reconstruction methodologies from inthe-wild images [26,28,29,9,8], which make use of a statistical model of 3D facial motion by means of a set of linear blenshapes, and propose a methodology for facial image translation using GANs driven by the continuous parameters of the linear blenshapes.The linear blendshapes can describe both the motion that is produced by expression [11] and/or motion that is produced by speech [30].On the contrary, neither discrete emotions nor facial action units can be used to describe the motion produced by speech or the combination of motion from speech and expression.We demonstrate that it is possible to transform a facial image along the contin-uous axis of individual expression and speech blendshapes.
Moreover, contrary to StarGAN, which uses discrete labels regarding expression, and GANimation, which utilizes annotations with regards to action units, our methodology does not need any human annotations, as we operate using pseudo-annotations provided by fitting a 3D Morphable Model (3DMM) to images [9] (for expression deformations) or by aligning audio signals [30] (for speech deformations).Building on the automatic annotation process exploited by SliderGAN, a by-product of our training process is a very robust regression DCNN that estimates the blendshape parameters directly from images.This DCNN is extremely useful for expression and/or speech transfer as it can automatically estimate the blendshape parameters of target images.
i2i translation models have achieved photo-realistic results by utilizing different GAN optimization methods in literature.pix2pix employed the original GAN optimization technique proposed in [16].However, the loss function of GAN may lead to the vanishing gradients problem during the learning process.Hence, more effective GAN frameworks emerged that were employed by i2i translation methods.CycleGAN uses LSGAN, which builds upon GAN adopting a least squares loss function for the discriminator.StarGAN and GANimation use WGAN-GP [17], which enforces gradient clipping as a measure to regularize the discriminator.WGAN-GP, builds upon WGAN [3] which minimizes an approximation of the Wasserstein distance to stabilize training of GANs.
A recent approach of efficient GAN optimization which has been used to produce higher quality textures [32], is the Relativistic GAN (RGAN) [19].RGAN was suggested in order to train the discriminator to simultaneously decrease the probability that real images are real, while increasing the probability that the generated images are real.In our work, we incorporate RGAN in the training process of SliderGAN and demonstrate that it can improve the generator which produces more detailed results in the task of i2i translation for expression and speech synthesis, when compared to training with WGAN-GP.In particular, we employ the Relativistic average GAN (RaGAN) which decides whether an image is relatively more realistic than the others on average, rather than whether it is real or fake.More details, as well as the benefits from this mechanism are presented in Section 3.1.
To summurize, the proposed method includes quite a few novelties.First of all, we showcase that Slider-GAN is able to synthesize smooth deformations of expression and speech in images by utilizing 3D blend-shape models of expression and speech respectively.Moreover, it is the first time to the best of our knowledge that a direct comparison of blendshape and AU coding is presented, for the task of expression and speech synthesis.In addition, our approach is annotation-free but offers much better accuracy that AUs-based methods.Furthermore, it is the first time that Relativistic GAN was employed for the task of expression and speech synthesis.We demonstrate in our results that SliderGAN trained with the RaGAN framework (SliderGAN-RaD) benefits towards producing more detailed textures, than when trained with the standard WGAN-GP framework (SliderGAN-WGP).Finally, we enhance the training of our model with synthesized data, leveraging the reconstruction capabilities of statistical shape models.

Expression Blendshape Models
Blendshape models are frequently used in computer vision tasks as they constitute an effective parametric approach of modelling facial motion.The localized blendshape model [23] proposed a method to localize sparse deformation modes with intuitive visual interpretation.The model was built by sequences of manually collected expressive 3D face meshes.In more detail, a variant of sparse Principal Component Analysis (PCA) was applied to a matrix D = [d 1 , ..., d m ] ∈ R 3n×m , which includes m difference vectors d i ∈ R 3n , produced by subtracting each expressive mesh from the neutral mesh of each corresponding sequence.Therefore, the sparse blendshape components C ∈ R h×1 where recovered by the following minimization problem: where, the constraint V can either be max According to [23], the selection of the constraints mainly controls whether face deformations will take place towards both negative and positive direction of the axes of the model's parameters or not, which is useful for describing shapes like muscle bulges.The regularization of sparse components C was performed with 1/ 2 norm [34,4], while to compute optimal C and B, an iterative alternating optimization was employed.The exact same approach was employed by [11], in the construction of the 4DFAB blendshape model exploited in this work.The 5 most significant deformation components of the 4DFAB expression model are depicted in Fig. 2.

Extraction of expression parameters by 3DMM fitting
3DMM fitting for 3D reconstruction of faces consists of optimizing three parametric models, the shape, texture and camera models, in order to render a 2D instance as close as possible to the input image.To extract the expression parameters from an image we employ 3DMM fitting and particularly the approach proposed in [9].
In our pipeline we employ the identity variation of LSFM [10], which was learned from 10,000 face scans of unique identity, as the shape model to be optimized.To incorporate expression variation in the shape model, we combine LSFM with the 4DFAB blenshape model [11], which was learned from 10,000 face scans of spontaneous and posed expression.The complete shape model can then be expressed as: where s is the mean component of 3D shape, U s,id and U s,expr are the identity and expression subspaces of LSFM and 4DFAB respectively, and p id and p expr are the identity and expression parameters which are used to determine 3D shape instances.Therefore, by fitting the 3DMM of [9] in an input image I, we can extract identity and expression parameters p id and p exp that instantiate the recovered 3D face mesh S(p id , p exp ).Based on the independent shape parameters for identity and expression, we exploit parameters p exp to compose an annotated dataset of images and their corresponding vector of expression parameters {I i , p i exp } K i=1 , with no manual annotation cost.

Proposed Methodology
In this section we develop the proposed methodology for continuous facial expression editing based on sliding the parameters of a 3D blendshape model.

Slider-based Generative Adversarial Network for continuous facial expression and speech editing
Problem Definition Let us here first formulate the problem under analysis and then describe our proposed approach to address it.We define an input image I org ∈ R H×W ×3 which depicts a human face of arbitrary expression.We further assume that any facial deformation or grimace evident in image I org , can be encoded by a parameter vector p org = [p org,1 , p org,2 , ..., p org,N ] , of N continuous scalar values p org,i , normalized in the range [−1, 1].In addition, the same vector p org constitutes the parameters of a linear 3D blendshape model S exp that, as in Fig. 3, instantiate the 3D representation of the facial deformation of image I org which is given by the expression: where s is a mean 3D face component and U exp the expression eigenbasis of the 3D blendshape model.Our goal is to develop a generative model which given an input image I org and a target expression parameter vector p trg , will be able to generate a new version I gen of the input image with simulated expression given by the 3D expression instance S exp (p trg ).
Attention-Based Generator To address the challenging problem described above, we propose to employ a Generative Adversarial Network architecture in order to train a generator network G that performs translation of an input image I org , conditioned on a vector of 3D blendshape parameters p trg ; thus, learning the generator mapping G(I org |p trg ) → I gen .In addition, to better preserve the content and the colour of the original images we employ an attention mechanism at the output of the generator as in [1,25].That is we employ a generator with two parallel output layers, one producing a smooth deformation mask G m ∈ R H×W and the other a deformation image G i ∈ R H×W ×3 .The values of G m are restricted in the region [0, 1] by enforcing a sigmoid activation.Then, G m and G i are combined with the original image I org to produce the target expression I gen as: Relativistic Discriminator We employ a discriminator network D that forces the generator G to produce realistic images of the desired deformation.Different from the standard discriminator in GANimation which estimates the probability of an image being real, we employ the Relativistic Discriminator [19] which estimates the probability of an image being relatively more realistic than a generated one.That is if is the activation of the standard discriminator, then D RaD,img = σ(C(I org ) − C(I gen )) is the activation of the Relativistic Discriminator.Particularly, we employ the Relativistic average Discriminator (RaD) which accounts for all the real and generated data in a minibatch.Then, the activation of the RaD is: where E Iorg and E Igen define the average activations of all real and generated images in a mini-batch respectively.
We further extend D by adding a regression layer parallel to D img that estimates a parameter vector p est , to encourage the generator to produce accurate facial expressions, D(I) → D p (I) = p est .Finally, we aim to boost the ability of G to maintain face identity between the original and the generated images by incorporating a face recognition module F.

Semi-supervised training
We train our model in a semi-supervised manner with both data with no image pairs of the same person under different expressions {I i org , p i org , p i trg } K i=1 and data with image pairs that we automatically generate as described in detail in Section 4.1, The modules of our model, as well as the training process of SliderGAN are presented in Fig. 4.

Adversarial Loss
To improve the photorealism of our synthesized images we utilize the Wasserstein GAN adversarial objective with gradient penalty (WGAN-GP) [17].Therefore, the selected WGAN-GP adversarial objective with RaD is defined as: Different from the standard discriminator, both real and generated images are included in the generator part of the objective of Eq. 6.This allows the generator to benefit by the gradients of both real and fake images, which as we show in experimental section leads to generated images with sharper edges and more details which also better represent the distribution of the real data.
Based on the original GAN rational [16] and the Relativistic GAN [19], our generator G and discriminator D are involved in a min-max game, where G tries to maximize the objective of Eq.( 6) by generating realistic images to fool the discriminator, while D tries to minimize it by correctly classifying real images as more realistic than fake and generated images as less realistic than real.
Expression Loss To make G consistent in accurately transferring target deformations S exp (p trg ) to the generated images, we consider the discriminator D to have the role of an inspector.To this end, we backpropagate a mean squared loss between the estimated vector p est of the regression layer of D and the actual vector of expression parameters of an image.
We apply the expression loss both for original images and generated ones.Similarly to the classification loss of StarGAN [12], we construct separate losses for the two cases.For real images I org we define the loss: between the estimated and real expression parameters of I org , while for the generated images we define the loss: between the estimated and target expression parameters of I gen = G(I org , p trg ).Consequently, D minimizes L exp,D to accurately regress the expression parameters of real images, while G minimizes L exp,G to generate images with accurate expression according to D.

Image Reconstruction Loss
The adversarial and the expression loss of Eq.( 6) and Eq.( 7), Eq.( 8) respectively, would be enough to generate random realistic expressive images which however, would not preserve the contents of the input image I org .To overcome this limitation we admit a cycle consistency loss [35] for our generator G: over the vectorized forms of the original image I org and the reconstructed image  rec Note that we obtain image I rec by using the generator twice, first to generate image I gen = G(I org , p trg ) and then to get the reconstructed I rec = G(I gen , p org ), conditioning I gen on the parameters p org of the original image.
Image Generation Loss To further boost our generator towards accurately transferring the expression from a vector of parameters to the edited image, we introduce image pairs of the form {I i org , p i org , I i trg , p i trg } L i=1 that we automatically generate from neutral images as described in detail in Section 4.1.We exploit the synthetic pairs of images of the same individualss under different expression by introducing an image generation loss: where I trg and I gen are images with either neutral or synthetic expression of the same individual.Here, we calculate the L1 loss between the synthetic ground truth image I trg and the generated by G, I gen , aiming to boost our generator to accurately transfer the 3D expression S exp (p trg ) to the edited image.
Identity Loss Image reconstruction loss of Eq.( 9), aids to maintain the surroundings between the original and generated images.However, the faces' identity is not always maintained by this loss, as also show by our ablation study in Section 4.7.To alleviate this issue, we introduce a face recognition loss adopted from Arc-Face [14], which models face recognition confidence by an angular distance loss.Particularly, we introduce the loss: e gen e org e gen e org , where e gen = F(I gen ) and e org = F(I org ) are embeddings of I gen and I org respectively, extracted by the face recognition module F. According to ArcFace, face verification confidence is higher as the cosine distance cos(e gen , e org ) grows.During training, G is optimized to maintain face identity between I gen and I org which minimizes Eq. (11).
Attention Mask Loss To encourage the generator to produce sparse attention masks G m that focus on the deformation regions and do not saturate to 1, we employ a sparsity loss L att .That is we calculate and minimize the L1-norm of the produced masks for both the generated and the reconstructed images, defining the loss as: Total Training Loss We combine loss functions of Eq.( 6) -Eq.( 12) to form loss functions L G and L D for separately training the generator G and the discriminator D of our model.We formulate the loss functions as: where λ exp , λ rec , λ gen , λ id and λ att are parameters that regularize the importance of each term in the total loss function.We discuss the choice of those parameters in Section 3.2.
As can be noticed in Eq.( 13), we employ different loss functions L G , depending on if the training data are the real data with no image pairs or the synthetic data which include pairs.The only difference is that in the case of paired data we use the additional supervised loss term L gen .

Implementation and training details
Having presented the architecture of our model, here we report further implementation and training details.For the generator module G of SliderGAN, we adopted the architecture of CycleGAN [35] as it is proved to generate remarkable results in image-to-iamge translation problems, as for example in StarGAN [12].We extended the generator by adding a parallel output layer to accomodate the attention mask mechanism.Moreover, for D we adopted the architecture of PatchGAN [18] which produces probability distributions of the multiple image patches to be real or generated, D(I) → D img .As described in Section 3.1, we extended this discriminator architecture by adding a parallel regression layer to estimate continuous expression parameters.
We trained our model with images of size 128 × 128, aligned to a reference shape of 2D landmarks.As condition vectors for our experiments, we utilized the 30 most significant expression components of 4DFAB and the 10 most significant speech components of LRW-3D [30].We set the batch size to 16 and trained our model for 60 epochs with Adam [20] (β 1 = 0.5, β 2 = 0.999).We first trained our model only with the generated image pairs for 20 epochs and then proceeded to unsupervised training for another 40 epochs with unpaired images.Lastly, we chose loss weights λ adv = 30, λ exp = 1000, λ rec = 10, λ gen = 10, λ id = 4 and λ att = 0.3.Larger values for λ id significantly restrict G, driving it to generate images very close to the original ones with no change in expression.Also, lower values for λ att , lead to mask saturation.

Experiments
In this section we present a series of experiments that we conducted in order to evaluate the performance of SliderGAN.First, we describe the datasets we utilized to train and test our model (Section 4.1).Then, we test the ability of SliderGAN to manipulate the expression in images by adjusting a single or multiple parameters of a 3D blendshape model (Section 4.2).Moreover, we present our results in direct expression transfer between an input and a target image (Section 4.3) and in discrete expression synthesis (Section 4.4).We examine the ability of SliderGAN to handle face deformations due to speech (Section 4.5) and test the regression accuracy of our model's discriminator (Section 4.6).We close the experimental section of our work by presenting an ablation study on the contribution of the different loss functions of our technique (Section 4.7).

Datasets
Emotionet For the training and validation phases of our algorithm we utilized a subset of 250,000 images of the EmotioNet database [5], which contains over 1 million images of expression and emotion, accompanied by annotations about facial Action Units.However, SliderGAN is trained with image -blenshape parameters pairs which are not available.Therefore, in order to extract the expression parameters we fit the 3DMM of [9] on each image of the dataset in use.To ensure the high quality of 3D reconstruction, we employed the LSFM [10] identity model concatenated with the expression model of 4DFAB [11].The 4DFAB expression model was built from a collection of over 10,000 expressive face 3D scans of spontaneous and posed expressions, collected from 180 individuals in 4 sessions over the period of 5 years.SliderGAN exploits the scale and representation power of 4DFAB to learn how to realistically edit facial expressions in images.The method described above constitutes a technique to automatically annotate the dataset and eliminates the need of costly manual annotation.
3D Warped Images One crucial problem of training with pseudo-annotations extracted by 3DMM fitting on images, is that the parameter values are not always consistent as small variations in expression can be mistakenly explained by the identity, texture or camera model of the 3DMM.To overcome this limitation, we augment the training dataset with expressive images that we render and therefore know the exact blenshape parameter values.In more detail, we fit with the same 3DMM 10,000 images of EmotioNet in order to recover the identity and camera models for each image.A 3D texture can also be sampled by projecting the recovered mesh on the original image.Then, we combined the identity meshes with randomly generated expressions from the 4DFAB expression model and rendered back on the original images.Rendering 20 different expressions from each image, we augmented the dataset by 200,000 accurately annotated images.Some of the generated images are displayed in Fig. 5 4DFAB Images A common problem of developing generative models of facial expression is the difficulty in accurately measuring the quality of the generated images.This is mainly due to the lack of databases with images of people of the same identity with arbitrary expressions.To overcome this issue and quantitatively measure the quality of images generated by SliderGAN, as well as compare with the baseline, we created a database with rendered images from 3D meshes and textures of 4DFAB.In more detail, we rendered 100 to 500 images with arbitrary expression from each of the 180 identities and for each of the 4 sessions of 4DFAB, thus rendering 300,000 images in total.To obtain expression parameters for each rendered image, we projected the blendshape model S exp on each corresponding 3D mesh S such that the obtained parameters are p = U exp (S − s).
Lip Reading Words in 3D (LRW-3D) Lip Reading in the Wild (LRW) dataset [13] consists of videos of hundreds of speakers including up to 1000 utterances of 500 different words.LRW-3D [30] provides speech blendshapes parameters for the frames of LRW, which were recovered by mapping each frame of LRW that correspond to one of the 500 words to instances of a 3D blendhshape model of speech, by aligning the audio segments of the LRW videos and those of a 4D speech database.Moreover, to extract expression parameters for each word segment of the videos we applied the 3DMM video fitting algorithm of [9], which accounts for the temporal dependency between frames.In Section 4.5, we utilize the annotations of LRW-3D as well as the expression parameters to perform expression and speech transfer.

3D Model-based Expression Editing
Sliding single expression parameters In this experiment we demonstrate the capability of SliderGAN to edit the facial expression of images when single expression parameters are slid within the normalized range [-1, 1].In Fig. 6 we provide results for 10 levels of activation of single parameters of the model (-1, -0.8, -0.6, -0.4,-0.2, 0, 0.2, 0.4, 0.6, 0.8, 1), while the rest parameters remain zero.As can be observed in Fig. 6, Slid-erGAN successfully learns to reproduce the behaviour of each blendshape separately, producing realistic facial expressions while maintaining the identity of the input image.Also, the transition between the generated expressions is smooth for successive values of the same parameter and the intensity of the expressions dependent on the magnitude of the parameter value.Note that when the zero vector is applied, SliderGAN pro- duces the neutral expression, whatever the expression of the original image.

Sliding multiple expression parameters
The main feature of SliderGAN is its ability to edit facial expressions in images by sliding multiple parameters of the model, similarly to sliding parameters in a blendshape model to generate new expressions of a 3D face mesh.To test this characteristic of our model, we synthesize random expressions by conditioning the generator input on parameter vectors with elements randomly drawn from the standard normal distribution.Note that the model was trained with expression parameters normalized by the square root of the eigenvalues e i , i = 1, ..., N of the PCA blendshape model.This means that all combinations of expression parameters within the range [-1, 1] correspond to feasible facial expressions.
As illustrated by Fig. 7, SliderGAN is able to synthesize face images with a great variability of expressions, while maintaining identity.The generated expressions accurately resemble the 3D meshes' expressions when the same vector of parameters is used for the blendshape model.This fact makes our model ideal for facial expression editing in images.A target expression can first be chosen by utilizing the ease of perception of 3D visualization of a 3D blendshape model and then, the target parameters can be employed by the generator to edit a face image accordingly.

Expression Transfer and Interpolation
A by-product of SliderGAN is that the discriminator D learns to map images to expression parameters D p that represent their 3D expression through S exp (D p ).We capitalize on this fact to perform direct expression transfer and interpolation between images without any annotations about expression.Assuming a source image I src with expression parameters p src = D p (I src ) and a target image I trg with expression parameters p trg = D p (I trg ), we are able to transfer expression p trg to image I src by utilising the generator of SliderGAN, such that I src→trg = G(I src |p trg ).Note that no 3DMM fitting or manual annotation is required to extract the expression parameters and transfer the expression, as this is performed by the trained discriminator.
Additionally, by interpolating the expression parameters of the source and target images, we are able to generate expressive faces that demonstrate a smooth transition from expression p src to expression p trg .Interpolation of the expression parameters can be performed by sliding an interpolation factor a within the region [0,1] such that the requested parameters are p interp = ap src + (1 − a)p trg .

Input
Random synthesized expressions Qualitative Evaluation Results of performing expression transfer and interpolation on images of the 4DFAB rendered database and Emotionet are displayed in Fig. 8 and Fig. 9 respectively, where it can be seen that the expressions of the generated images obviously reproduce the target expressions.The smooth transition between expressions p src and p trg indicates that SliderGAN successfully learns to map images to expressions across the whole expression parameter space.Also, it is evident that D accurately regresses the blendshape parameters from images I trg by observing the recovered 3D faces.The accuracy of the regressed parameters is also examined in Section 4.6.
To further validate the quality of our results, we trained GANimation on the same dataset with AU annotations extracted with OpenFace [2] as suggesed by the authors.We performed expression transfer between images and present results for SliderGAN-RaD, SliderGAN-WGP and GANimation.In Fig. 10, it is obvious that SliderGAN-RaD benefits from the Relativistic GAN training and produces higher quality textures than SliderGAN-WGP, while both SliderGAN implementations better simulate the expressions of the target images than GANimation.
Quantitative Evaluation In this section we provide quantitative evaluation on the performance of Slid-erGAN on arbitrary expression transfer.We employ the 4DFAB rendered images dataset which allows us to calculate the Image Euclidean Distance [31] between ground truth rendered images of 4DFAB and images generated by SliderGAN.Image Euclidea Distance is a robust alternative metric to the standard pixel loss for image distances, which is defined between two RGB images x and y each with M × N pipxels as: where P i and P j are the pixel locations on the 2D image plane and x i , y i , x j , y j the RGB values of images x and y at the vectorized locations i and j.
We trained SliderGAN with the rendered images from 150 identities of 4DFAB, leaving 30 identities for testing.To allow direct comparison between generated and real images, we randomly created 10,000 pairs of images of the same session and identity (this ensures that the images were rendered with the same camera conditions) from the testing set and performed expression transfer within each pair.To compare our model against the baseline model GANimation, we trained and performed the same experiment using GANimation on the same dataset with AUs activations that we obtained with OpenFace.Also, to showcase the bene-

Input Target
Fig. 8 Expression interpolation between images of 4DFAB.First, we employ D to recover the expression parameters from an input and the target images.Then, we capitalize on these parameter vectors to animate the expression of the input image towards multiple targets.

Input Target
Fig. 9 Expression interpolation between images of Emotionet.First, we employ D to recover the expression parameters from an input and the target images.Then, we capitalize on these parameter vectors to animate the expression of the input image towards multiple targets.

SliderGAN-RaD
Fig. 10 Expression transfer between images of Emotionet.First, we employ D to recover expression parameters from the target images.Then, we utilize these parameter vectors to transfer the target expressions to the input images.From the results, SliderGAN-RaD produces higher quality textures than any of the other two methods (mostly evident in the mouth and eyes regions).Moreover, GANimation reproduces the target expressions with lower accuracy.(Please, zoom in the images to notice the differences in texture quality.)  1 where it can be seen that SliderGAN-RaD produces images with the lowest IED.

Synthesis of Discrete Expressions
Specific combinations of the 3D expression model parameters represent the discrete expressions anger, contempt, fear, disgust, happiness, sadness, surprise and neutral.We employ these parameter vectors to synthesize expressive face images of the aforementioned discrete expressions and test our results both qualitatively and quantitatively.
Qualitative Evaluation To evaluate the performance of SliderGAN in this task, we visually compare our results against the results of five baseline models: DIAT [21], CycleGAN [35], IcGAN [24], StarGAN [12] and GANimation [25].In Fig. 11 it is evident that Slid-erGAN generates results that resemble the queried expressions while maintaining the original face's identity and resolution.The results are close to those of GANimation, however the Relativistic GAN training of Slid-erGAN allows for slightly higher quality of images.
The neutral expression can also be synthesized by SliderGAN when all the elements of the target parameter vector are set to 0. In fact, the neutral expression of the 3D blendshape model is also synthesized by the same vector.Results of image neutralization on in-thewild images of arbitrary expression are presented in Fig. 12, where it can be observed that the neutral expression is generated without significant loss in faces' identity.
Quantitative Evaluation We further evaluate the quality of the generated expressions by performing expression recognition with the off-the-self recognition system [22].In more detail, we randomly selected 10,000 images from the test set of Emotionet, translated them to each of the discrete expressions anger, disgust, fear, happiness, sadness, surprise, neutral and passed them to the expression recognition network.For comparison, we repeated the same experiment with SliderGAN-WGP and GANimation using the same image set.In Table 2 we report accuracy scores for each expression class separately, as well as the average accuracy score for the three methods.The classification results are similar for the three models, with both implementations of SliderGAN producing slightly higher scores, which demotes that GANimation's results include more fail cases.

Combined Expression and Speech Synthesis and Transfer
Blendshape coding of facial deformations allows modelling arbitrary deformations (e.g.deformations due to identity, speech, non-human face morphing etc.) that are not limited to facial expressions, unlike AUs coding which is a system that taxonomizes the human facial muscles [15].Even though AUs 10-28 model mouth and lip motion, not all the details of lip motion that takes place during speech can be captured by these AUs.Moreover, only 10 (10, 12, 14, 15, 17, 20, 23, 25, 26, 28) out of these 18 AUs can automatically be recognized, which is achieved only with low accuracy.On the contrary, a blendshape model of the 3D motion of the human mouth and lips would better capture motion during speech, while it would allow the recovery of robust representations from images and videos of human speech.
We capitalize on this fact and employ the mouth and lips blendshape model of [30] to perform speech synthesis from a single image with SliderGAN.Particularly, we employ the LRW-3D database which contains speech blendshape parameters annotations for the 500 words of LRW [13], to perform combined expression and speech synthesis and transfer, which we evaluate both qualitatively and quantitatively.
Qualitative Evaluation LRW contains videos with both expression and speech.Thus, to completely capture the smooth face motion across frames we employed 30 expression parameters recovered by 3DMM fitting and 10 speech parameters of LRW-3D which correspond to the ten most significant components of the 3D speech model.We trained SliderGAN with 180,000 frames of LRW, without leveraging the temporal characteristics of the database, that is we shuffled the frames and trained our model with random target vectors to avoid learning person specific deformations.Results of performing expression and speech synthesis from a video using a single image are presented in Fig. 13 where the the parameters and the input frame belong to the same Input DIAT [21] CycleGAN [35] ICGAN [24] StarGAN [12] GANimation [25] SliderGAN-RaD By comparing SliderGAN against DIAT [21], CycleGAN [35], IcGAN [24], StarGAN [12] and GANimation [25] we observe that our model generates results of high texture quality that resemble the queried expressions.The results of the rest of the methods where taken from [25].video (ground truth frames are available) and in Fig. 14 where the parameters and the input frame belong to different videos of LRW.
For comparison we trained GANimation on the same dataset with AU activations obtained by OpenFace.As can be seen by Fig. 13 and Fig. 14, GANimation is not able to accurately simulate the lip motion of the target video.On the contrary, SliderGAN-WGP simulates mouth and lip motion well, but produces textures that look less realistic.SliderGAN-RaD produces higher quality results that look realistic in terms of accurate deformation and texture.
Quantitative Evaluation To measure the performance of our model we employ Image Euclidean Distance (IED) [31] to evaluate the results of expression and speech synthesis when the input frame and target parameters belong to the same video sequence.Due to changes in pose in the target videos, we align all tar-

Input
Target video / Synthesized expression and speech  3, where it can be seen that SliderGAN-RaD achieves the lowest error.

3D Expression Reconstruction
As also described in Section 4.3, a by-product of Slid-erGAN is the discriminator's ability to map images to expression parameters D p that reconstruct the 3D expression as S exp (D p ).We test the accuracy of the regressed parameters on images of Emotionet in two scenarios: a) we calculate the error between parameters recovered by 3DMM fitting and those regressed by D on the same image as (Table 4 row 1) and b) we test the consistency of our model and calculate the error between some target parameters p trg and those regressed by D on a manipulated image which was translated to expression p trg by SliderGAN-RaD (Table 4 row 2).
For comparison, we repeated the same experiment with GANimation for which we calculated the errors in AUs activations.For both experiments we employed 10000 images from our test set.The results demonstrate that the discriminator of SliderGAN-RaD extracts expression parameters from images with high accuracy compared to 3DMM fitting.On the contrary, GANimation's discriminator is less consistent in recovering AU annotations when compared to those of OpenFace.This, also, illustrates that the robustness of blendshape coding of expression over AUs, makes SliderGAN more suitable than GANimation for direct expression transfer.

Ablation Study
In this section we investigate the effect of the different losses that constitute the total loss functions L G and L D of our algorithm.As discussed in Section 3.1, both training in a semi-supervised manner with loss L gen and employing a face recognition loss L id between the original and the generated images, contribute signifi-

Input
Target video / Synthesized expression and speech

SliderGAN-RaD
Fig. 14 Comparison of combined expression and speech animation from a single input image between GANimation [25], SliderGAN-WGP and SliderGAN-RaD.We utilize as targets the expression and speech blendshape parameters of consecutive frames of a video of LRW.Then we reconstruct the expression and speech from a single input frame of the same video.Both SliderGAN implementations reconstruct face motion more accurately than GANimation.Also, the texture quality of the results is higher in SLiderGAN-RaD than in SLiderGAN-WGP as expected.(Please, zoom in the images to notice the differences in texture quality.) Table 4 Expression representation results on SLiderGAN-RaD (blendshape parameters coding) and Ganimation (AUs activations coding).SliderGAN is capable to accurately and robustly recover expression representations, while GANimation fails to detect AUs activations.
SliderGAN GANimation [25] 1 learning process is a very powerful regression network that maps the image into a number of blenshape parameters, which can then be used for conditioning the inputs of the generator.

Fig. 1
Fig. 1 Expressive faces generated by sliding a single or multiple blendshape parameters in the normalized range [−1, 1].Rows 1 and 3 depict 3D expressive faces generated by a linear blendshape model of natural face motion and a set of expression parameters.The corresponding edited images generated by SliderGAN using the same set of parameters are depicted in rows 2 and 4. As it is observed, the generated images accurately replicate the 3D faces' motion.The robustness of blendshape coding of facial motion allows SliderGAN to perform speech synthesis, as demonstrated in rows 5 (target speech) and 6 (synthesized speech), for which a 3D blenshape model of human speech was utilized.

Fig. 2
Fig. 2 Visualization of the 5 most significant components of the blendshape model S exp .The 3D faces of this figure have been generated by adding the multiplied components to a mean face.

Fig. 3
Fig. 3 Examples of the 3D representation of the expression of an image by the model S exp .The 3D faces of this figure have been generated by 3DMM fitting on the corresponding images.

Fig. 4
Fig.4Synopsis of the modules, losses and the training process of SliderGAN.A attention-based generator G is trained to generate realistic expressive faces from continuous parameters by employing a set of adversarial, generation, reconstruction, identity and attention losses.The performance of our model is significantly boosted by employing synthetic image pairs through the L gen loss.Moreover, a relativistic discriminator D is trained to classify images as relatively more real or fake, as well as to regress expression parameters of the input images in order to increase the generation quality of G.

Fig. 5
Fig. 5 Synthetic expressive faces, generated by fitting a 3DMM on the original images and rendering back with a randomly sampled expression.The images with a red frame are the original images.

Fig. 6
Fig. 6 Expressive faces generated by sliding single blendshape (b/s) parameters in the range [−1, 1].As it is observed, the edited images accurately replicate the 3D faces' motion in the whole range of parameter values.

Fig. 7
Fig. 7 Expressive faces generated by sliding multiple blendshape (b/s) parameters in the range [−1, 1].As it is observed, the wide range of the edited images accurately replicate the 3D faces' motion.

Fig. 12
Fig. 12 Neutralization of in-the-wild images of arbitrary expression.The neutralization takes place by setting all blendshape parameter values to zero.

Fig. 13 Table 3
Fig.13Combined expression and speech animation from a single input image.We utilize as targets the expression and speech blendshape parameters of consecutive frames of videos of LRW, to synthesize sequences of expression and speech from a single input image.

5 ConclusionFig. 15
Fig.15Results from the ablation study on SliderGAN's loss function components.It is evident that both losses L id and L gen have significant impact on the training of the model, with L id being the most important for generating realistic images.

Table 1
[25]e Euclidean Distance (IED), calculated between ground truth images of 4DFAB and corresponding generated images by Ganimation[25], SliderGAN-WGP and SliderGAN-RaD.Results from SliderGAN-RaD produce the lowest IED between the three methods.